Graph Mining for APT
Detection
Xitao Wen
1
Problem Scope
• Inputs
– Multi-granularity Heterogeneous
Information Network (HIN)
– Accurate provenance tagging
• Objectives
– Causality Identification
– Connectivity Anomaly Detection
– Behavior Anomaly Detection
2
Heterogeneous Information
Network
• Directed graphs with entities and events
representing temporal information flows
• Features
– Heterogeneity
• Entities: processes/threads, hosts, routers…
Anything else?
• Events: run-on, send-to, read…
– Multi-granularity
• Any examples for each granularity?
– Large scale and dynamic
• Any estimation on scale and velocity?
3
Topic 1: Causality Identification
• Definition
– Given an (anomalous) event, back-propagate in
HIN and find the events that may lead to this
event
• Input
– HIN
– Anomalous event
– Searching parameters (Optional)
• Output
– Subgraph of events that the anomalous event
depends on
4
Causality Identification
• Basic technique
– Graph searching with dynamic pruning heuristics
• Possible parameters/heuristics
– Search depth
– Branch factor
– Granularity
• Challenges
– Combinatorial explosion of search space
– Quality of search results
– Modeling domain-specific pruning parameters
5
Causality Identification
• Clarification questions
– How to determine two events are
related?
– How is HIN stored? What are the query
complexity?
6
Topic 2: Connectivity Anomaly
Detection
• Definition
– An anomalous connection or a missing
connection between two entities
• Input
– Clean information flows (for training)
– Test information flows
• Output
– Anomalous connections or missing
connections in test set
7
Connectivity Anomaly Detection
• Basic Techniques
– Design statistical models to conduct
likelihood estimation
– Unsupervised learning to spot
anomalous connections through analysis
of clusters
– (They are entirely beyond my
knowledge base…)
8
Topic 3: Behavior Anomaly
Detection
• Definition
– Identify frequent behavior patterns (i.e.
subgraphs) and find instances that are
similar but deviate slightly from the pattern
• Input
– A large collection individual activity
subgraphs
• Output
– Behavior patterns
– Anomalous instances
9
Behavior Anomaly Detection
• Basic Technique
– Frequent pattern mining in graphs (a decades-long research topic)
• Challenges
– Scalability
– Deal with noise in data
– Differentiating benign deviations and malicious deviations
– Quantifying the severity of anomalies
• References cited in the proposal
– William Eberle and Lawrence B. Holder. Anomaly detection in data
represented as graphs. Intell. Data Anal., 07.
– William Eberle and Lawrence B. Holder. Discovering structural
anomalies in graph-based data. In MGCS ’07
– William Eberle and Lawrence B. Holder. Applying graph-based
anomaly detection approaches to the discovery of insider threats.
ISI ’09
10
More Previous Work on Frequent
Subgraph Pattern Mining
• Approaches
– Apriori-based approach:
• AGM by Inokuchi et al.
• FSG by Kuramochi and Karypis,
• an edge-disjoint path-join algorithm by Vanetik et al.
– Pattern-Growth Approach
• gSpan by Yan and Han
• MoFa by Borgelt and Berthold
• FFSM by Huan et al.
• SPIN by Huan et al.
• Gaston by Nijssen and Kok
• More scalable approaches
– Significant subgraph
– Representative subgraph
– Dense subgraph
• Fundamental challenge
– Combinatorial explosion: search space grows exponentially with the
pattern size
11
12