Machine Learning Bro Ids
Machine Learning Bro Ids
Machine Learning,
Bro and You!
Pandas DataFrame with all the right types and timestamp as index
What’s the intended audience?
• People who like Python
• Interested in Pandas, scikit-learn, Spark, Parquet
• Hate seeing examples on Iris data or TF-IDF
• Frustrated when trying to use your own data
• Want easy examples using Bro!
Are you going to show super scalable blah?
• Presentation will talk about Pandas, Scikit-Learn
• We also have classes/notebooks on:
• Kafka
• Parquet
• Spark
• We’ll show a some of this stuff…
● Big Picture
● Software Bridges (BAT)
○ Bro to Python
○ Bro to Pandas
○ Bro to Scikit-Learn
● Example: Anomaly Detection
○ Bro DNS and HTTP logs
○ Categorical and Numeric Data
○ Clustering
○ Isolation Forests
What is BAT?
A simple to use Python Module that
makes getting Bro data into popular data
Bro Analysis analysis and ML package super easy!
Tools
$ pip install bat https://github.com/Kitware/bat
Who’s Kitware?
● ~130 people, offices around the world
● Developing and supporting open
source software for 25 years
● New information security program
● Summer Internships available J
You guys haven't seen
Talk Outline my rabbit have you?
● Big Picture
● Software Bridges
○ Bro to Python
○ Bro to Pandas
○ Bro to Scikit-Learn
● Example: Anomaly Detection
○ Bro DNS and HTTP logs
○ Categorical and Numeric Data
○ Clustering
○ Isolation Forests
Hello World
from pprint import pprint
from bat import bro_log_reader
Step 1: $ pip install bat
Step 2: Write a few lines of code # Run the bro reader on a given log file
reader = bro_log_reader.BroLogReader('dhcp.log')
Step 3: There is no step 3... for row in reader.readrows():
pprint(row)
● Big Picture
● Software Bridges
○ Bro to Python
○ Python to Pandas
○ Pandas to Scikit-Learn
● Example: Anomaly Detection
○ Bro DNS and HTTP logs
○ Categorical and Numeric Data
○ Clustering
○ Isolation Forests
Anomaly Detection
Popular Mental Images
Challenges: I-Forests
● Streaming Data
● Data Volume Anomalous Output:
● Categorical and Numerical Types DNS/HTTP ● 1-5% of data
● Efficient DataFrame/Matrix conversions ● Uncommon (by def)
* http://github.com/Kitware/bat/blob/master/notebooks/Anomaly_Detection.ipynb ● Good Base Camp
Isolation Forests: Anomaly Detection
https://github.com/Kitware/bat/blob/master/notebooks/Anomaly_Detection.ipynb
Anomalous to Interesting
Organization + User Feedback
Anomalous Example: 10k rows clustered and
DNS/HTTP organized for displayed to user *
Interesting
Organization and
Anomalous Clustering
Display and
Challenges: Feedback*
● Streaming Data
● Organization and Clustering
Interesting
● Engaging the Human Output:
● User Interface and Feedback* ● Fraction of 1%-5%
● Clustered/organized
* Feedback will be used in the next phase of the pipeline
* http://github.com/Kitware/bat/blob/master/notebooks/Anomaly_Detection.ipynb
● Ready for Feedback*
Demo: Anomaly Detection
https://github.com/Kitware/bat/blob/master/notebooks/Bro_to_Scikit.ipynb
https://github.com/Kitware/bat/blob/master/notebooks/Anomaly_Detection.ipynb
Demo: Bro to Kafka to Spark
https://github.com/Kitware/bat/blob/master/notebooks/Bro_to_Kafka_to_Spark.ipynb
Demo: Bro to Parquet to Spark
https://github.com/Kitware/bat/blob/master/notebooks/Bro_to_Parquet_to_Spark.ipynb
Questions/Comments?