Discovering Exfiltration Paths Using Reinforcement

Uploaded by

ray King

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views8 pages

Discovering Exfiltration Paths Using Reinforcement

Uploaded by

ray King

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Exposing Surveillance Detection Routes via

Reinforcement Learning, Attack Graphs, and Cyber

Terrain
Lanxiao Huanga∗ , Tyler Codya , Christopher Redinob , Abdul Rahmanb ,
Akshay Kakkarb , Deepak Kushwahab , Cheng Wangb , Ryan Clarkb ,
Daniel Radkeb , Peter Belinga , Edward Bowenb
a
National Security Institute, Virginia Polytechnic University
b
Deloitte & Touche LLP
∗
arXiv:2211.03027v1 [cs.LG] 6 Nov 2022

Corresponding author: Lanxiao Huang: hlanxiao@vt.edu

Abstract—Reinforcement learning (RL) operating on attack Modern efforts to detect and respond to adversarial network
graphs leveraging cyber terrain principles are used to develop reconnaissance are a complicated blend of automated and
reward and state associated with determination of surveillance human processes. Automated collection systems are installed
detection routes (SDR). This work extends previous efforts on
developing RL methods for path analysis within enterprise on network devices and endpoints to passively and actively
networks. This work focuses on building SDR where the routes monitor the network communications, analyze the flow, and
focus on exploring the network services while trying to evade aggregate the data for the Security Information Event Man-
risk. RL is utilized to support the development of these routes agement (SIEM) and/or Security Orchestration, Automation
by building a reward mechanism that would help in realization and Response (SOAR) systems for analysis. These network
of these paths. The RL algorithm is modified to have a novel
warm-up phase which decides in the initial exploration which tools assist the human component of detection by providing
areas of the network are safe to explore based on the rewards automated security reports, incident alerts, and executing net-
and penalty scale factor. work protection protocols with a single click from the security
Index Terms—attack graphs, reinforcement learning, surveil- operations center (SOC) analyst. The effectiveness of these
lance detection routes, SDR, cyber terrain systems relies on the data collected, the knowledge of current
threat behavior, and the human analyst’s ability to understand
I. I NTRODUCTION the threat. Naturally, such approaches have blindspots.
Combining the current security information (network topol-
Reconnaissance (also called recon) in MITRE’s Adversarial ogy and configuration) with machine learning (ML) anal-
Tactics, Techniques, and Common Knowledge (ATT&CK®) ysis allows highlighting weak points missed by automated
framework is described as ”techniques that involve adversaries systems that an attacker may focus on during initial recon.
actively or passively gathering information that can be used Network traffic behavior analysis, no matter how advanced,
to support targeting [1].” As reconnaissance activities usually relies on active network traffic and does not preempt net-
precede an exploitation campaign, detection of these efforts work/host/protocol mis-configurations. This paper contributes
benefit cyber defenders by identifying potential targets (e.g., a deep RL approach to generating SDR in the form of attack
crown jewels) of interest. In this respect, adversarial recon graphs from network models consisting of network topology
activities strive to maximize visibility of targets while mini- and configuration, thereby extending the suite of automated
mizing opportunities of being detected. Critical to this are the tools and systems available for the cyber defense.
identification of paths, termed SDR, traversed by adversaries In particular, an Markov decision process (MDP) formula-
to gather critical data about targets (e.g., ports, protocols, tion and a new algorithm, SDR-RL, is proposed that uses a
applications, services). warm-up phase and penalty scaling to control the asymmetry
From a cyber protection standpoint, recon activities disguise between the number of host services scanned and the number
serious hostile intent but may be quite difficult to quantitatively of defensive terrain encountered. This emulates the asym-
differentiate from normal behavior. Malicious intent is quite metry sought by human operators when conducting network
difficult to observe, as it may be efficiently designed to hide reconnaissance generally and SDR in particular. Additionally,
in the background of normal traffic. Detecting this type of this paper extends the double agent architecture (DAA) of
recon is situational in nature and requires meticulous analyses Nguyen et al. [3], which originally used standard deep Q-
of huge volumes of collected data. Other domains apart learning (DQN), with actor-critic (A2C) and proximal-policy
from cyber are often challenged in a similar manner, where optimization (PPO) algorithms [4].
evaluation of these SDR require data analysis to differentiate This paper is structured as follows. First, background on
abnormal traffic from events that are suspicious in proximity RL and penetration testing is given. Then, the methods used
to roads and crossings [2]. to expose SDR using RL, attack graphs, and cyber terrain
are presented. Next, experimental design for testing the pre-
sented methodology is given, including details regarding RL
implementation and training. Results are presented, and, before ∇θi Li (θi ) =
concluding, an in depth discussion is given in terms of cyber- Es,a∼p(·);s∼ [(r + γ max
0
Q∗ (s0 , a0 ; θi−1 )
a
specific outcomes.
− Q(s, a; θi ))∇θi Q(s, a; θi )] (6)
II. RL AND P ENETRATION T ESTING
B. The Penetration Testing Environment
A. Reinforcement Learning Preliminaries
Reinforcement learning (RL) problems involve an agent, Though RL has been pursued as a tool for penetration
interacting with an environment, and transiting from one state testing recently, the approaches to model the network vary
to another until it reaches the goal state [5]. The agent significantly. Alternative methods to modeling penetration
receives rewards for taking actions with the overall goal testing including hypothesis generation, ontology-based, attack
to achieve maximum cumulative rewards. Environments are trees and attack graphs. Hypothesis generation model [8] is an
normally modeled as MDPs, which can be defined by a tuple organizational network presentation for cyber defense while
{S, A, R, P, γ} where S denotes the set of possible states ontology-based models [9] focus more on the semantics of
and A denotes the set of possible actions, R represents the penetration testing. However, neither of them contains any
distribution of reward given any state-action pair, P represents structural information about the network itself. While both
transition probability and γ is discount factor. The goal of attack trees [10] and attack graphs do provide structural rep-
the agent is to learn an optimal policy π that maps states resentation of the network, attack graphs are more generative
to actions. As so, the RL algorithm is to learn an optimal and attack trees are special cases of attack graphs. As a result,
policy to select actions and maximize the expected cumulative attack graph are the most recognized method for the modeling
discounted reward: of penetration testing environment [11].
X
π ∗ = argmax E[ γ t rt |π] C. Related Work
π (1)
t>0
with s0 ∼ p(s0 ), at ∼ π(·|s0 ), st+1 ∼ p(·|st , at ) The use of deep RL for attack graphs has seen recent
development. Other than Ghanem and Chen [12], the authors
In (deep) Q-learning, the optimal policy can be defined by in the RL for attack graphs literature use fully observable
two terms, the value function and Q-value function. The value MDPs to model networks. Many authors use the Common
function shows how good the state is and it is defined as the Vulnerability Scoring System (CVSS) to furnish their MDPs
expected cumulative reward from following the policy from (CVSS-MDPs). Yousefi et al. provide the earliest work doing
the state X so in deep RL for penetration testing [13]. Hu et al. extend
V π (s) = E[ γ t rt |π] (2) the use of the CVSS by proposing to use exploitability scores
t>0 weight rewards [14]. Gangupantulu et al. [15], [16] and Cody
The Q-value function, on the other side, is defined as the et al. [17] explicitly extend the methods of Hu et al. with
expected cumulative reward given both parts of the state-action concepts of terrain. Gangupantulu et al. advocate defining
pair models of terrain in terms of the rewards and transition
probabilities of MDPs, first in the case of firewalls as obstacles
X
Qπ (s, a) = E[ γ t rt |s0 = s, a0 = a, π] (3)
t>0
[15], then in the case of lateral pivots nearby key terrain [16].
Cody et al. apply these concepts to exfiltration [17]. Other
It satisfies the Bellman equation authors either handcraft the MDP or do not remark on how
Q∗ (s, a) = Es ∼ [r + γ max Q∗ (s0 , a0 )|s, a] (4) its components are estimated.
0 a Many authors apply generic deep Q learning (DQN) [6], [7]
Thus, the optimal policy π ∗ corresponds to taking the best to solve point-to-point network traversal tasks [13]–[15], [18],
action in any state as specified by Q∗ . [19]. Typically the terminal state is unknown and solutions
For a more complex problem where the state or action take the form of individual paths. Others develop domain-
space is large enough so that computing every state-action specific modifications for deep RL including the double agent
pair is infeasible, neural networks become a powerful function architecture [3], a hierarchical action decomposition approach
estimator in deep Q-learning (DQL) [6], [7]. The optimal Q- [20], and various improvements to DQN termed NDSPI-
values are estimated by a neural network parameterized by DQN [21]. Another line of research focuses on developing
θ: more specific penetration testing tasks. A number of authors
Q(s, a; θ) ≈ Q∗ (s, a) (5) define more specific tasks by reward engineering and other
modifications to the MDP including formulations of capture
where yi = Es0 ∼ [r + γ maxa0 Q∗ (s0 , a0 ; θi−1 )|s, a] the flag [22], crown jewel analysis [16], and discovering
During the backward pass, the gradients can be calculated exfiltration paths [17]. This paper extends this line of research
as with a methodology for exposing SDR.
III. M ETHODS of defense terrain increases. Therefore, we limit the agent’s
The following subsections describe the presented methods exploration capability as we increase the scale value of defense
for adding service-based risk penalties as defensive terrain in terrain.
CVSS-MDPs and the algorithm for discovering surveillance D. Training Phase
paths in a network.
In the training phase, the agent interacts with network in
A. Defensive Terrain in CVSS-MDPs an episodic fashion to learn which is the best possible path
Gangupantulu et al. [15] proposed that cyber terrain can be to gain service information of all the target nodes identified
modeled into CVSS-MDPs by adding transition probabilities during the warmup phase.
for traversing firewalls and negative rewards for different E. Scalability
protocols. Cody et al. [17] later modeled the services-based
Conventionally, DQN is the most basic algorithm used for
defensive terrain in CVSS-MDPs based on the assumption that
modeling RL on penetration testing. Nguyen at al. proposed a
the attackers can infer the presence of defenses terrain based
method that utilizes two A2C agents: one called the structural
on the services running on a host. We adopt their methods
agent that learns the structural information about subnets,
and classify the services into four categories: authentication,
hosts and firewalls as well as the connections between them.
data, security and common. Our reward hierarchy assigns -6
The other called the exploiting agent selects actions to take
for authentication, -4 for data and -2 for security and common.
and the target of them. Their method solves the scalability
Moreover, the type of actions has an effect on the rewards. n+1
problem of DQN to some extent and has a better capacity for
(-1, -3 and -5) is assigned for scanning action while n (-2, -4
large networks. We improve upon this method by applying
and -6) is assigned for exploiting action.
proximal policy optimization (PPO) instead of A2C for both
B. Discovering Surveillance Paths with RL of the agents. PPO is an advanced RL algorithm known
In our surveillance detection routes (SDR) algorithm (Al- for its convergence speed, stability and sample efficiency. It
gorithm 1), we have a target node of interest which we want optimizes the following clipped surrogate objective function to
to explore all the services running on it. The goal of SDR is prevent performance collapse caused by a large policy update:
to gain the service information of the target node along with
h i
L(θ) = E min ρt (θ)At , clip ρt (θ), 1 − , 1 + At , (7)
maximizing service information discovery along other areas
of the network while being cautious and not triggering any where ρt (θ) = πθ (at |st )/πθold (at |st ) is the probability ratio
defensive terrain area. We give a discovery reward of 100 to of the new policy over the old policy. The advantage function
the target node when all its service have been discovered. To At is often estimated using the generalized advantage estima-
encourage the service discovery the agent receives a +1 reward tion [23], truncated after T steps:
for each service discovered on a node. The algorithm is divided
into two phases - a warm-up phase and a training phase. At ≈ δt + (γλ)δt+1 + · · · + (γλ)T −t+1 δT −1 , (8)
where δt = rt + γV (st+1 ) − V (st ). (9)
C. Warmup Phase
In the warmup phase, we want to update our goal for the IV. E XPERIMENTAL D ESIGN
RL agent to not only gain information of just the target node
In the following subsections the network, state-action space,
but also other nodes from where we receive positive reward
and RL algorithm implementation are described.
indicating that the defense terrain allows us to access this area.
The following steps are taken in the warmup phase: A. Network Description
• We define a certain number of episodes for warmup phase The same network framework as in Cody et al [17] is used
in the training configuration. for our experiment but with different configurations in defense
• In the warmup episodes the RL agent does not learn (no mechanism. To simulate real-world network conditions, there
weight updates) but only monitors positive reward during are layers of defenses between the Internet and the innermost
an episode. private network. Systems that require Internet to provide
• If the RL agent receives a positive reward from a scanning service (HTTP, email) are most vulnerable to attack; and are
action, then that node is added to the goal along with the typically in the Demilitarized Zone (DMZ) with limited access
target node and its reward for compromise is set to 100. to private network resources. The private, internal networks are
• The algorithm allows only one node (which gives out the separated from the DMZ with firewalls that apply rules that
maximum positive reward) from each subnet to be part only allow connections to specific internal servers/services.
of the goal as gaining control of one node in a subnet is VPN services to the internal network is protected with VPN
enough to gain service information of the entire subnet. Management Firewalls that apply network rules allowing only
At the end of the warmup phase, we have these dynamic authorized and authenticated user traffic to traverse internally
nodes along with the target node of which we must gain the over an encrypted connection. Internal network subnets are
service information to reach the goal. The number of dynamic separated based on access rules and allow traffic to egress
nodes being a part of the goal decreases as the scale value to the Internet if it is authorized, as well as applying rules
Algorithm 1: SDR via RL (SDR-RL) using PPO. We use Adam as the optimizer of our network. The
input : MDP M, initial node i, set of target nodes J, A2C model is trained for 4000 episodes and the two double
number of warmup episode Nw , RL algorithm agent models are trained for 8000 episodes with a maximum
fRL : M × i × J → path of 3000 steps in each episode. The episode terminates either
output: SDR that includes initial and target nodes when the goal is reached or when the step limit is reached.
Both the A2C model and the structuring agent of the DA
for n in range(Nw ) do
models use deep neural networks (DNNs) with three fully
for each subnet MS in M do
connected layers of 64, 32 and 1 and the exploiting agent
for each node j in MS do
of the DA models use a DNN with two fully connected layers
if rewardj > 0 then
of size 10 and 1. All DNNs use tanh activation functions for
dynamic nodes ←
non-output layers and softmax for the output layer.
dynamic nodes ∪ J
end
D. Sensitivity analysis
end
if dynamic nodes 6= ∅ then We train the A2C, DA-A2C, DA-PPO to convergence. Our
J ← max(dynamic nodes) ∪ J surveillance detection algorithm is run on four different scales
end values namely 1, 3, 5 and 11 for host (9,2) (8,2), and (3,1).
end For host (9,2) and (8,2) we run an extra scale of 15. The scale
end values were experimental in nature and each scale defines a
path ← fRL (M, i, J); certain drop in exploration of the network. When the scale
value increases, the agent becomes more risk-averse and thus
return paths keeps on reducing the amount of exploration.

V. R ESULTS
for network traversal between subnets. Finally, access to the The model convergence is showcased by plotting the steps
innermost subnet is controlled by a firewall that allows only and reward versus episodes and results are shown in Figure 1.
authorized traffic in or out, and only to specific hosts. As can be seen from Figure 2, DA-PPO trains significantly
These security controls were known and applied by the faster than DA-A2C on all five penalty scales, both in terms
designers but were intentionally left unknown to the model of wall-time and number of episodes. Specifically, DA-PPO
to provide an accurate simulation of an attacker exploring an converges in less than ten minutes as compared to more than
unknown network environment. 30 minutes by DA-A2C (Figure 2a), and it requires much
B. Environment Description fewer episodes to learn an effective policy (Figure 2b).
The DAA modeling highlights a pattern that showed the
The hosts are each represented as an 1D vector that encodes
expected path being taken at all the penalty scales for all the
its status (compromised, reachable, discovered or not) and
three different target host, minimizing the number of defenses
configurations (address, services, OS and processes). Next, our
crossed. While the path converges by taking a relative higher
environment combines all vectors of hosts in the network as
number of episodes, it thus offers higher scalability while
a entire state tensor. Our action space contains 6 primitive
applying it to real world use cases with larger networks.
actions for scanning, exploiting and privilege escalation.
Across the models, it is observed that as the penalty keeps
For these experiments, the initial host is set at (1, 0) while
on increasing the agent reduces the amount of exploration and
the terminal host is set as (3, 1), (8, 2) or (9, 2). Here, the host
only explored areas which are deemed to be safe. The simpler
(x, y) refers to the host indexed by y in subnet x. The initial
A2C model followed the expected path being taken in penalty
host is compromised, reachable and discovered by default so
scale of 1, 11 and 15 for three different target host. However,
that it allows the agent to perform further exploration, thus
for scale of 3 and 5 in target host of (3,1), the A2C Model
the simulation assumes the attacker already has gained this
takes a potential imperfect path which expose to a higher risk;
foothold in the network. The goal of SDR is to explore all the
this suggests that even for a small test network of this size, a
services of the target host. And a high positive reward (100)
simpler agent scheme fails to find realistic paths.
is assigned if the goal is reached.
C. RL Implementation VI. D ISCUSSION
The experiment is conducted on the single-agent A2C model We can observe the paths taken by the agent to achieve the
and two variants of the double agent (DA) architecture [3]. In SDR goal when target hosts are (8,2) and (3,1) for penalty
the original double agent model, DA-A2C, both the structuring scales 1, 3 and 11 in Fig 3. In Table 1, we can observe that as
and the exploiting agents are trained using the A2C algorithm. we are adjusting the scale factor the number of services being
To achieve better sample efficiency and training stability, we explored drops off. Furthermore, the number of high penalty
combine the PPO algorithm [4] with the DA architecture and host controlled goes down from 2 to 1 as we scaled the penalty
build a DA-PPO model, where both RL agents are trained showcasing that the agent becomes more risk averse.
Fig. 1: Training performance of DAA and A2C agent with different penalty scales: (a) episode-reward of A2C agent; (b)
episode-step of A2C agent; (c) episode-reward of DAA agent and (d) episode-step of DAA agent.

High Medium Low

Target Host Scale Factor Services
Penalty Host Penalty Host Penalty Host
1 207 2 0 4
3 166 1 0 4
(3, 1)
5 102 1 0 1
11 72 1 0 1
1 207 2 0 4
3 166 1 0 4
(8, 2)
5 134 1 1 2
11 104 1 0 2
1 207 2 0 4
3 166 1 0 4
(9, 2)
5 134 1 0 2
11 104 1 0 1

TABLE I: Table of the number of services and hosts explored with different target hosts and scale factors.

For penalty scale 1, the agent takes a path which according penalty host (i.e., (4,0)) which is again a necessity to explore
to the environment and penalty is acceptable as in its explo- services on the target host. However, when the target host is
ration of network it is taking control of hosts which are either (3,1), the agent takes a path which is less acceptable according
low or medium risk except one high penalty host (i.e., (4,0)) to the environment and penalty as it takes control of two high
which is a necessity as to reach the goal at least one high-risk penalty node (i.e., (4,0) and (3,0)) where only one was required
host needs to be controlled. Therefore, the path taken by the to achieve the goal. The agent had an option to select (3,2)
agent for a low penalty of 1 is acceptable. which is a lower penalty host than (3,0).

For penalty scale 3, when the target is (8,2) the agent takes For penalty scale 11, the agent takes a path which is
an acceptable path according to the environment and penalty acceptable according the state and penalty of the environment
as it takes control of only low penalty host except one high and takes control of host which are low risk apart from only
(a) (b)
Fig. 2: Training performance of DA-A2C and DA-PPO with different penalty scales. The left plot shows the total rewards in
an episode as training time increases. The right plot shows the total number of steps in each training episode.

one high risk host which is unavoidable in achieving the goal. yields great impact, the associated risk is also higher. This
Associating penalty scores (low to high) to the risk adversity risky behavior is reflected by the agent as seen in Table 1
of human actors demonstrates that the agent performs actions with the larger number of services and high penalty hosts
that are reasonably in line with expected human advisory being exploited, as well as represented by the paths chosen
behavior. in Figure 3.
Actors operating in a condensed timeline, commonly re- When risk adversity is high, the cost of getting caught
ferred as “smash and grab” operations, or actors without outweighs the reward of network discovery, so actors, and as
sufficient experience, would be examples of an agent set with observed, the agent, take the most direct path possible. Risk
a Penalty Scale of 1.0. Nation-State actors, highly-competent adverse actors naturally choose paths that avoid traversing
APTs (Advanced Persistent Threats), and experienced actors multiple protocols/services and presumed increased security
with more time to observe and a higher cost of getting caught controls. This lowers the log footprint and network noise
would be representative of an agent with a penalty scale of generated and reduces the chances of detection. The agent’s
11.0. behavior of only taking control of one high risk host to reach
When risk adversity is low, an actor is not worried about the lower target reflects the lowest risk path that an advisory
performing “noisy” scans. These scans will include enumera- would naturally gravitate to.
tion of multiple network services or simultaneously traversing
multiple network segments (with each new segment potentially VII. C ONCLUSION
having deeper layers of defense incorporated into it). As In this work, we provide a RL realization of automating
observed when the penalty score is set to one (1.0), the analysis methods for SDR. Our approach introduces a warm-
agent finds and explores paths that would involve a higher up phase to perform pre-exploration emulating an experienced
level of risk of getting caught and a lower sensitivity to actors necessity to find the “safe” areas of a network. The
negative consequences. These actions involve scanning or capability and efficiency of our methods are validated by
traversing paths with a reasonable presumption of greater simulations on a custom network configured with defense
security and more rigorously logged and monitored network mechanisms (simulated cyber range).
devices, such as VPNs and Firewalls (FW) (High Risk Hosts) The model ran on a network created to simulate the realistic
as well as enumerating networks not directly associated with placement of layered network defenses with increasing secu-
the respective target. The risk-accepting paths also explored rity controls as an actor draws closer to the most sensitive
several networks unnecessarily and utilize multiple services information and services. As the model was faced with penal-
along the way. Each service to be used and/or exploited along ties that increased when transitioning into more secure zones,
the way creates another risk and/or protection system that the effect of the penalty score became more demonstrative.
may be in place, thus increasing the log presence and actor This activity mirrors the expected risk sensitivity of an actor
footprint. (As stated above, it is presumed that firewalls are mirroring the risk profile of the model’s penalty score. A more
monitored at a higher rate, and that security services have the risk/penalty adverse model showed significantly more concise
most inherited controls.) While exploitation of these devices paths to reach the goal, while the less risk-adverse model
Fig. 3: Network diagram showing the SDR to different target nodes for various penalty scale factors: host (3,1) for the first
row and (8,2) for the second row. The nodes in green indicates the hosts whose service information is available while the
information of nodes in red is not. The nodes in purple are set as initial nodes while the nodes in yellow are set as target
nodes.

explored around and left more of a footprint on the defense [9] G. Chu and A. Lisitsa, “Ontology-based automation of penetration
system’s logs. testing.” in ICISSP, 2020, pp. 713–720.
[10] B. Schneier, “Attack trees,” Dr. Dobb’s journal, vol. 24, no. 12, pp.
In future work, we expect to run this model on a live 21–29, 1999.
network utilizing both internal and external network data. [11] T. Cody, “A layered reference model for penetration testing with rein-
Incorporating internal and external scan information allows forcement learning and attack graphs,” arXiv preprint arXiv:2206.06934,
2022.
for a model that most-closely mimics a malicious actor and [12] M. C. Ghanem and T. M. Chen, “Reinforcement learning for efficient
defensive expectations for their actions. Using live data for network penetration testing,” Information, vol. 11, no. 1, p. 6, 2020.
a production network should also create greater granularity [13] M. Yousefi, N. Mtetwa, Y. Zhang, and H. Tianfield, “A reinforcement
for the penalty scoring as vulnerability data for CVSS3 and learning approach for attack graph analysis,” in 2018 17th IEEE In-
ternational Conference On Trust, Security And Privacy In Computing
weakness data for associated Common Weakness Enumeration And Communications/12th IEEE International Conference On Big Data
(CWE) can be incorporated. Science And Engineering (TrustCom/BigDataSE). IEEE, 2018, pp. 212–
217.
[14] Z. Hu, R. Beuran, and Y. Tan, “Automated penetration testing using
R EFERENCES deep reinforcement learning,” in 2020 IEEE European Symposium on
[1] (2021) Mitre att&ck framework®. [Online]. Available: https://attack. Security and Privacy Workshops (EuroS&PW). IEEE, 2020, pp. 2–10.
mitre.org [15] R. Gangupantulu, T. Cody, P. Park, A. Rahman, L. Eisenbeiser,
[2] R. Schoemaker, R. Sandbrink, and G. van Voorthuijsen, “Intelligent route D. Radke, and R. Clark, “Using cyber terrain in reinforcement learning
surveillance,” in Unattended Ground, Sea, and Air Sensor Technologies for penetration testing,” Submitted ACM ASIACCS 2022, 2021.
and Applications XI, E. M. Carapezza, Ed., vol. 7333, International [16] R. Gangupantulu, T. Cody, A. Rahman, C. Redino, R. Clark, and
Society for Optics and Photonics. SPIE, 2009, pp. 83–90. P. Park, “Crown jewels analysis using reinforcement learning with attack
[3] H. V. Nguyen, S. Teerakanok, A. Inomata, and T. Uehara, “The proposal graphs,” arXiv preprint arXiv:2108.09358, 2021.
of double agent architecture using actor-critic algorithm for penetration [17] T. Cody, A. Rahman, C. Redino, L. Huang, R. Clark, A. Kakkar,
testing.” in ICISSP, 2021, pp. 440–449. D. Kushwaha, P. Park, P. Beling, and E. Bowen, “Discovering exfiltration
[4] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox- paths using reinforcement learning with attack graphs,” arXiv preprint
imal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, arXiv:2201.12416, 2022.
2017. [18] J. Schwartz and H. Kurniawati, “Autonomous penetration testing using
[5] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. reinforcement learning,” arXiv preprint arXiv:1905.05965, 2019.
MIT press, 2018. [19] A. Chowdhary, D. Huang, J. S. Mahendran, D. Romo, Y. Deng, and
[6] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wier- A. Sabur, “Autonomous security analysis and penetration testing,” in
stra, and M. Riedmiller, “Playing atari with deep reinforcement learn- 2020 16th International Conference on Mobility, Sensing and Network-
ing,” arXiv preprint arXiv:1312.5602, 2013. ing (MSN). IEEE, 2020, pp. 508–515.
[7] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. [20] K. Tran, A. Akella, M. Standen, J. Kim, D. Bowman, T. Richer, and C.-T.
Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski Lin, “Deep hierarchical reinforcement agents for automated penetration
et al., “Human-level control through deep reinforcement learning,” testing,” arXiv preprint arXiv:2109.06449, 2021.
Nature, vol. 518, no. 7540, pp. 529–533, 2015. [21] S. Zhou, J. Liu, D. Hou, X. Zhong, and Y. Zhang, “Autonomous pen-
[8] C. Weissman, “Penetration testing,” Information security: An integrated etration testing based on improved deep q-network,” Applied Sciences,
collection of essays, vol. 6, pp. 269–296, 1995. vol. 11, no. 19, p. 8823, 2021.
[22] F. M. Zennaro and L. Erdodi, “Modeling penetration testing with
reinforcement learning using capture-the-flag challenges: trade-offs be-
tween model-free learning and a priori knowledge,” arXiv preprint
arXiv:2005.12632, 2020.
[23] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “High-
dimensional continuous control using generalized advantage estimation,”
arXiv preprint arXiv:1506.02438, 2015.

Detecting Adversarial Attacks on Graphs
No ratings yet
Detecting Adversarial Attacks on Graphs
11 pages
Surveillance Detection
100% (12)
Surveillance Detection
30 pages
Fase de Reconocimiento en Pentesting Un Enfoque Práctico
No ratings yet
Fase de Reconocimiento en Pentesting Un Enfoque Práctico
11 pages
Machine Learning and Deep Learning Techniques For Distributed Denial of Service Anomaly Detection in S
No ratings yet
Machine Learning and Deep Learning Techniques For Distributed Denial of Service Anomaly Detection in S
30 pages
SDN-Based Moving Target Defense
No ratings yet
SDN-Based Moving Target Defense
2 pages
Evaluation of Distributed Denial of Service Attacks Detection in Software Defined Networks
No ratings yet
Evaluation of Distributed Denial of Service Attacks Detection in Software Defined Networks
11 pages
SDN-Based Low-Rate DDoS Defense
No ratings yet
SDN-Based Low-Rate DDoS Defense
15 pages
Articulos Previos de Investigacion
No ratings yet
Articulos Previos de Investigacion
68 pages
Network Detection and Response in The SOC Securonix
No ratings yet
Network Detection and Response in The SOC Securonix
7 pages
Conference-template-A4 (AutoRecovered)
No ratings yet
Conference-template-A4 (AutoRecovered)
6 pages
A Self-Adaptive Deep Learning-Based System For Anomaly Detection in 5G Networks
No ratings yet
A Self-Adaptive Deep Learning-Based System For Anomaly Detection in 5G Networks
13 pages
Hybrid Deep Learning for DDoS in SDN
No ratings yet
Hybrid Deep Learning for DDoS in SDN
17 pages
Applsci 13 03183
No ratings yet
Applsci 13 03183
27 pages
APTGuard-DL A Hybrid Deep Learning Paradigm v2
No ratings yet
APTGuard-DL A Hybrid Deep Learning Paradigm v2
21 pages
2025 - 9919 - OnlinePDF (Updated)
No ratings yet
2025 - 9919 - OnlinePDF (Updated)
37 pages
CYBER ATTACKS DETECTION USING GoogleNet MODEL FOR ENVIRONMENTAL AWARE SMART CITY APPLICATIONS
No ratings yet
CYBER ATTACKS DETECTION USING GoogleNet MODEL FOR ENVIRONMENTAL AWARE SMART CITY APPLICATIONS
10 pages
Proactive Threat Hunting in Modern SOCs
No ratings yet
Proactive Threat Hunting in Modern SOCs
16 pages
DL 2P DDoSADF
No ratings yet
DL 2P DDoSADF
13 pages
Framework For Network Topology Generation and Traffic Prediction Analytics For Cyber Exercises
No ratings yet
Framework For Network Topology Generation and Traffic Prediction Analytics For Cyber Exercises
12 pages
1 Vol 6 No 12
No ratings yet
1 Vol 6 No 12
15 pages
Adversarial Diffusion Attacks On Graph-Based
No ratings yet
Adversarial Diffusion Attacks On Graph-Based
13 pages
LC80126 Network Detection and Response
No ratings yet
LC80126 Network Detection and Response
83 pages
Li Dissertation
No ratings yet
Li Dissertation
121 pages
DDoS Attack Identification and Defense Using SDN Based On Machine Learning Method
No ratings yet
DDoS Attack Identification and Defense Using SDN Based On Machine Learning Method
5 pages
Adversarial ML in Network Security
No ratings yet
Adversarial ML in Network Security
29 pages
2022-Inroads Into Autonomous Network Defence Using Explained Reinforcement Learning
No ratings yet
2022-Inroads Into Autonomous Network Defence Using Explained Reinforcement Learning
21 pages
Deep Reinforcement Learning For Cyber Security
No ratings yet
Deep Reinforcement Learning For Cyber Security
17 pages
Sok: Realistic Adversarial Attacks and Defenses For Intelligent Network Intrusion Detection
No ratings yet
Sok: Realistic Adversarial Attacks and Defenses For Intelligent Network Intrusion Detection
31 pages
2104 09df806v1
No ratings yet
2104 09df806v1
22 pages
Detecting - Conventional - and - Adversarial - Attacks - Using - Deep - Learning - Techniques - A - Systematic - Review
No ratings yet
Detecting - Conventional - and - Adversarial - Attacks - Using - Deep - Learning - Techniques - A - Systematic - Review
7 pages
Implementation of QOS in SDN and Distributed Networks For Mitigation of DDOS Based Attacks Using Mach
No ratings yet
Implementation of QOS in SDN and Distributed Networks For Mitigation of DDOS Based Attacks Using Mach
6 pages
Review Article: Network Attacks Detection Methods Based On Deep Learning Techniques: A Survey
No ratings yet
Review Article: Network Attacks Detection Methods Based On Deep Learning Techniques: A Survey
17 pages
Using Cyber Terrain in Reinforcement Learning For
No ratings yet
Using Cyber Terrain in Reinforcement Learning For
8 pages
Real-Time DDoS Defense in 5G-Enabled IoT A Multidomain Collaboration Perspective
No ratings yet
Real-Time DDoS Defense in 5G-Enabled IoT A Multidomain Collaboration Perspective
16 pages
Electronics 13 00807 v2
No ratings yet
Electronics 13 00807 v2
29 pages
Minh Chứng Cuộc Thi
No ratings yet
Minh Chứng Cuộc Thi
10 pages
Deep Learning Algorithms For Detecting Denial of Service Attacks
No ratings yet
Deep Learning Algorithms For Detecting Denial of Service Attacks
10 pages
Highlighted Anomaly and Intrusion Detection Using Deep Learning For Software-Defined Networks A Survey
No ratings yet
Highlighted Anomaly and Intrusion Detection Using Deep Learning For Software-Defined Networks A Survey
20 pages
Proactive Threat Hunting Model
No ratings yet
Proactive Threat Hunting Model
11 pages
Insdn: A Novel SDN Intrusion Dataset
No ratings yet
Insdn: A Novel SDN Intrusion Dataset
7 pages
Data Mining Based Cyber-Attack Detection: Tianfield, Huaglory
No ratings yet
Data Mining Based Cyber-Attack Detection: Tianfield, Huaglory
18 pages
An Ensemble-Based Approach For Effective Distributed Denial of Service Attack Detection in Software Defined Networking
No ratings yet
An Ensemble-Based Approach For Effective Distributed Denial of Service Attack Detection in Software Defined Networking
8 pages
Electronics 13 00932
No ratings yet
Electronics 13 00932
19 pages
A New Multivariate Approach For Real Time Detectio
No ratings yet
A New Multivariate Approach For Real Time Detectio
19 pages
Information 14 00041
No ratings yet
Information 14 00041
21 pages
An Analysis of Recurrent Neural Networks For Botnet Detection Behavior
No ratings yet
An Analysis of Recurrent Neural Networks For Botnet Detection Behavior
6 pages
XDR For Networks - Datasheet
No ratings yet
XDR For Networks - Datasheet
3 pages
DDOS Attacks Detection Based On Attention-Deep Learning and Local Outlier Factor
No ratings yet
DDOS Attacks Detection Based On Attention-Deep Learning and Local Outlier Factor
4 pages
Electronics: Experimental Cyber Attack Detection Framework
No ratings yet
Electronics: Experimental Cyber Attack Detection Framework
30 pages
Mitigation of Dos and Port Scan Attacks Using Snort
No ratings yet
Mitigation of Dos and Port Scan Attacks Using Snort
12 pages
Birkinshaw 2019
No ratings yet
Birkinshaw 2019
15 pages
An Entropy and Machine Learning Based Approach For Ddos Attacks Detection in Software Defined Networks
No ratings yet
An Entropy and Machine Learning Based Approach For Ddos Attacks Detection in Software Defined Networks
18 pages
Paper 274 - 274
No ratings yet
Paper 274 - 274
10 pages
A Deep Drive On Proactive Threat Hunting
No ratings yet
A Deep Drive On Proactive Threat Hunting
9 pages
Deep Convolutional Neural Networks For Intrusion Detection in Automotive Ethernet Networks
No ratings yet
Deep Convolutional Neural Networks For Intrusion Detection in Automotive Ethernet Networks
6 pages
Research Paper CNS
No ratings yet
Research Paper CNS
7 pages
Wireless Communications and Mobile Computing - 2022 - Kumar - Machine Learning Enabled Techniques For Protecting Wireless
No ratings yet
Wireless Communications and Mobile Computing - 2022 - Kumar - Machine Learning Enabled Techniques For Protecting Wireless
15 pages
Authentication Modeling With Five Generic Processes: Sabah Al-Fedaghi, Mennatallah Bayoumi
No ratings yet
Authentication Modeling With Five Generic Processes: Sabah Al-Fedaghi, Mennatallah Bayoumi
11 pages
A Novel Multifactor Authentication System Ensuring Usability and Security
No ratings yet
A Novel Multifactor Authentication System Ensuring Usability and Security
10 pages
SRW Design Guide
No ratings yet
SRW Design Guide
34 pages
On The Design and Analysis of A Biometric Authentication System Using Keystroke Dynamics
No ratings yet
On The Design and Analysis of A Biometric Authentication System Using Keystroke Dynamics
10 pages
Cyber Key Terrain v14
No ratings yet
Cyber Key Terrain v14
16 pages
Conceptualization and Cases of Study On Cyber Operations
No ratings yet
Conceptualization and Cases of Study On Cyber Operations
21 pages
Cybersecurity: RL for Data Exfiltration
No ratings yet
Cybersecurity: RL for Data Exfiltration
8 pages
2406.08443v1-Privacy Aware Memory Forensics
No ratings yet
2406.08443v1-Privacy Aware Memory Forensics
22 pages
Security, Privacy and Safety Evaluation of Dynamic and Static Fleets of Drones
No ratings yet
Security, Privacy and Safety Evaluation of Dynamic and Static Fleets of Drones
16 pages
2406.06225v1-Siren - Advancing Cybersecurity Through Deception and Adaptive Analysis
No ratings yet
2406.06225v1-Siren - Advancing Cybersecurity Through Deception and Adaptive Analysis
7 pages
2406.10282v1-Hardware-Based Stack Buffer Overflow Attack Detection On RISC-V Architectures
No ratings yet
2406.10282v1-Hardware-Based Stack Buffer Overflow Attack Detection On RISC-V Architectures
2 pages
LLM Backdoor Attacks & Defenses
No ratings yet
LLM Backdoor Attacks & Defenses
19 pages
2406.09005v1-Privacy Aware Memory Forensics
No ratings yet
2406.09005v1-Privacy Aware Memory Forensics
6 pages
EXERCICE D'APPLICATION
No ratings yet
EXERCICE D'APPLICATION
3 pages
VMW 10q3 White Paper Cloud Director Security PDF
No ratings yet
VMW 10q3 White Paper Cloud Director Security PDF
37 pages
Technicolor TC 7200 UserManual en
No ratings yet
Technicolor TC 7200 UserManual en
82 pages
FortiGate 100E Series PDF
No ratings yet
FortiGate 100E Series PDF
5 pages
Understanding DNS and Records
No ratings yet
Understanding DNS and Records
33 pages
Teamcenter 10.1: Publication Number PLM00015 J
No ratings yet
Teamcenter 10.1: Publication Number PLM00015 J
122 pages
Firewall Testing Methodology
No ratings yet
Firewall Testing Methodology
9 pages
Cyber Security Questions and Answers - Firewalls - 1: Advertisement
No ratings yet
Cyber Security Questions and Answers - Firewalls - 1: Advertisement
10 pages
WR-854 - B - Manual-01202004
No ratings yet
WR-854 - B - Manual-01202004
45 pages
Prioritized Approach For PCI DSS V20-Desprotegido
No ratings yet
Prioritized Approach For PCI DSS V20-Desprotegido
19 pages
Security 5
No ratings yet
Security 5
23 pages
Wireless Cable Gateway CG3000D-1CXNAS: User Manual
No ratings yet
Wireless Cable Gateway CG3000D-1CXNAS: User Manual
52 pages
Network Security Group Assignment
100% (1)
Network Security Group Assignment
50 pages
LAN & Internet Services Tender 2018
No ratings yet
LAN & Internet Services Tender 2018
40 pages
ElectroMyCycle Logical Network Topology
100% (1)
ElectroMyCycle Logical Network Topology
10 pages
ISC2 200+ Dump Questions
No ratings yet
ISC2 200+ Dump Questions
103 pages
What Is An On-Premises Data Gateway
No ratings yet
What Is An On-Premises Data Gateway
66 pages
CCNA Latest 2017 PDF
No ratings yet
CCNA Latest 2017 PDF
85 pages
Tp-Link Er5120
No ratings yet
Tp-Link Er5120
3 pages
Sicam A8000 Series Rtus Toolbox II Devicemanager Admin Security Eng
No ratings yet
Sicam A8000 Series Rtus Toolbox II Devicemanager Admin Security Eng
192 pages
Karan Singh R10
No ratings yet
Karan Singh R10
5 pages
Protecting Your Network From The Inside-Out: Internal Segmentation Firewall (ISFW)
No ratings yet
Protecting Your Network From The Inside-Out: Internal Segmentation Firewall (ISFW)
8 pages
Electronics Communication - Engineering - Information Network Security - Security Technologies Firewalls and Vpns - Notes
No ratings yet
Electronics Communication - Engineering - Information Network Security - Security Technologies Firewalls and Vpns - Notes
54 pages
PCNSA-Exam-Dumps (2020) PDF
No ratings yet
PCNSA-Exam-Dumps (2020) PDF
15 pages
9.4.1.7 Lab - Configuring ASA 5510 Basic Settings and Firewall Using ASDM - Instructor
No ratings yet
9.4.1.7 Lab - Configuring ASA 5510 Basic Settings and Firewall Using ASDM - Instructor
56 pages
BRKCRT-2602 Configuring and Troubleshooting Cisco Jabber Mobile Remote Access Using Collaboration-Edge Deployment Model
No ratings yet
BRKCRT-2602 Configuring and Troubleshooting Cisco Jabber Mobile Remote Access Using Collaboration-Edge Deployment Model
143 pages
Packet Tracer - Configure Firewall Settings
No ratings yet
Packet Tracer - Configure Firewall Settings
3 pages
Home Network Gigabit Router
No ratings yet
Home Network Gigabit Router
5 pages
PN-54WRT User Guide
No ratings yet
PN-54WRT User Guide
74 pages
Fortinet NSE4 Exam Insights
No ratings yet
Fortinet NSE4 Exam Insights
22 pages

Discovering Exfiltration Paths Using Reinforcement

Uploaded by

Discovering Exfiltration Paths Using Reinforcement

Uploaded by

Exposing Surveillance Detection Routes via

Reinforcement Learning, Attack Graphs, and Cyber

Corresponding author: Lanxiao Huang: hlanxiao@vt.edu

High Medium Low

You might also like