ANOMALY DETECTION IN
NETWORK TRAFFIC
ANOMALY DETECTION
USING ML
Muhamad Nizam Azmi
Mataram University
WHY ANOMALY DETECTION ?
Anomaly Detection Overview
Crucial task in identifying deviations from normal
behavior
Applications in fraud detection, industrial maintenance,
cybersecurity
Challenges
High-dimensional data
Large-scale distributed systems
Vast amounts of data
advacements in Machine Learning
Powerful tools for pattern recognition
Effective in complex and high-dimensional
datasets
NEXT- PROJECT
Distinguishing between normal and anomalous
behaviors
Unsupervised Learning Algorithms FOCUS
No need for labeled attack data
Ideal for real-world applications with scarce
labeled data
point anomaly
Collective anomaly
3 TYPES OF
contextual anomaly ANOMALY
point anomaly
Collective anomaly
3 TYPES OF
contextual anomaly ANOMALY
Point anomaly
detection
focus project
COMPARISON OF VARIOUS MACHINE LEARNING ALGORTHM
Aim: Identify the most effective techniques for different anomaly detection tasks
Provide insights and guidelines for future applications in cybersecurity, industrial monitoring, and beyond
ADABOOST Naive Bayes Gradient Boosting
Logistic Regression K-Nearest Neigbors (KNN) SVM
Random Forest Decision Tree Neural Network (NN)
WHY CHOOSE THESE
ALGORITHMS?
AdaBoost Naive Bayes Gradent Boosting Random Forest
AdaBoost is chosen for its Naive Bayes is selected for Gradient Boosting is chosen for Random Forest is chosen for its
ability to enhance the its simplicity, speed, and its high predictive accuracy high accuracy, ability to handle
performance of simple effectiveness in handling and ability to handle a variety large datasets with higher
models, making it effective in large datasets, making it of data types and distributions, dimensionality, and robustness
scenarios where data may suitable for real-time which is crucial for detecting against overfitting.
have a lot of noise or complex anomaly detection tasks. subtle anomalies.
patterns.
Logistic Regression KNN SVM Neural Network (NN)
KNN is chosen for its
Logistic Regression is selected SVM is selected for its Neural Networks are chosen
for its interpretability and simplicity and ability to
robustness in high- for their flexibility and ability to
effectiveness in binary perform well with small dimensional spaces and its model complex, non-linear
classification problems, making to medium-sized effectiveness in cases where relationships in data, which is
it useful for understanding and datasets, particularly the anomaly classes are not essential for accurately
explaining anomaly detection
when the data is not linearly separable. detecting anomalies in high-
results.
linearly separable. dimensional datasets.
WHY CHOOSE THESE
ALGORITHMS?
AdaBoost Naive Bayes Gradent Boosting Random Forest
ability to enhance the
Add a These algorithms collectively provide Random
AdaBoost is chosen for its Naive Bayes is selected for
its simplicity, speed, and
a
Gradient Boosting is chosen for
its high predictive accuracy
Forest is chosen for its
high accuracy, ability to handle
robust framework for identifying anomalies,dimensionality, and robustness
performance of simple effectiveness in handling and ability to handle a variety
large datasets with higher
models, making it effective in large datasets, making it of data types and distributions,
scenarios where data may suitable for real-time which is crucial for detecting
against overfitting.
leveraging their individual strengths to enhance
have a lot of noise or complex
patterns.
anomaly detection tasks. subtle anomalies.
the accuracy and reliability of anomaly detection
Logistic Regression KNN SVM Neural Network (NN)
Logistic Regression is selected
system.SVM is selected for its
KNN is chosen for its
Neural Networks are chosen
for its interpretability and simplicity and ability to
robustness in high- for their flexibility and ability to
effectiveness in binary perform well with small dimensional spaces and its model complex, non-linear
classification problems, making to medium-sized effectiveness in cases where relationships in data, which is
it useful for understanding and datasets, particularly the anomaly classes are not essential for accurately
explaining anomaly detection
when the data is not linearly separable. detecting anomalies in high-
results.
linearly separable. dimensional datasets.
Evaluation of Distributed ML Algorithm for Anomaly Detection
PREV-
RESEARCH
Astekin, M., Zengin, H., and Sözer, H. (2018) compared distributed machine
learning algorithms for system log analysis. They focused on scalability
and efficiency, highlighting the strengths of certain algorithms in handling
large datasets.
Metaheuristics and Machine Learning for Anomaly Detection in Big Data
Cavallaro, C., Cutello, V., Pavone, M., and Zito, F. (2023) reviewed the use of
metaheuristics combined with machine learning for anomaly detection.
Their study showed improved detection accuracy and adaptability to
different datasets.
Industrial Anomaly Detection with Neural Network Architectures
Siegel, B. (2020) compared neural network architectures for detecting
industrial anomalies. The study focused on real-time detection capabilities
and accuracy, discussing implementation challenges and solutions.
ETC.
SUMMARIZE
PREV-
RESEARCH
DATASET INFORMATION
Purpose: The KDD Cup 1999 dataset is used to build a predictive model to distinguish
between "bad" connections (intrusions or attacks) and "good" (normal) connections in
a computer network. It aims to protect the network from unauthorized users, including
potential insiders.
Background: The dataset is based on the 1998 DARPA Intrusion Detection Evaluation
Program, managed by MIT Lincoln Labs. The program's objective was to evaluate
research in intrusion detection using a standard set of data that includes various
intrusions simulated in a military network environment.
Data Collection.....
Environment: Simulated a typical U.S. Air Force LAN with multiple simulated attacks.
Duration: Data was collected over nine weeks (seven weeks for training, two weeks for testing).
Data Size:
Training data: 4 gigabytes of compressed binary TCP dump data, resulting in about five million connection
records.
Test data: Around two million connection records.
Connection Records: Each connection is a sequence of TCP packets between a source IP address and a target IP
address, labeled as either normal or a specific type of attack.
TYPES OF ATTACKS
DOS R2L
Denial of Service e.g., Syn flood Remote to Local e.g., guessing
passwords
U2R Probing
User to Root e.g., Buffer overflow e.g., port scanning
attacks
RESULT
ICMP: The most frequent protocol
type with over 250,000
occurrences.
TCP: The second most common
protocol type with over 150,000
occurrences.
UDP: The least frequent protocol
type with fewer than 50,000
occurrences.
RESULT
0: Not logged in
1: Successfully logged in
The number of users who did not
log in (0) is significantly higher
than those who successfully
logged in (1).
RESULT
Categories:
dos: Denial of Service attacks -
391,458 instances
normal: Normal traffic (no
attack) - 97,278 instances
probe: Surveillance and
probing attacks - 4,107
instances
r2l: Remote to local attacks -
1,126 instances
u2r: User to root attacks - 52
instances
RESULT
num_root: Removed due to high correlation with
num_compromised (Correlation = 0.9938).
srv_serror_rate: Removed due to high correlation
with serror_rate (Correlation = 0.9984).
srv_rerror_rate: Removed due to high correlation
with rerror_rate (Correlation = 0.9947).
dst_host_srv_serror_rate: Removed due to high
correlation with srv_serror_rate (Correlation =
0.9993).
dst_host_serror_rate: Removed due to high
correlation with rerror_rate (Correlation = 0.9870).
dst_host_rerror_rate: Removed due to high
correlation with srv_rerror_rate (Correlation =
0.9822).
dst_host_srv_rerror_rate: Removed due to high
correlation with rerror_rate (Correlation = 0.9852).
dst_host_same_srv_rate: Removed due to high
correlation with dst_host_srv_count (Correlation =
0.9737).
DECISION TREE RESULT
RANDOM FOREST RESULT
SVM RESULT
KNN RESULT
LOGISTIC REGRESSION RESULT
NEURAL NETWORK RESULT
GRADIENT BOOSTING RESULT
NAIVE BAYES RESULT
ADABOOST RESULT
ROC CURVE & FEATURE IMPORTANCES RESULT
DOS (Class 0): AUC (Area Under the Curve) = 1.00
Normal (Class 1): AUC = 1.00
Probe (Class 2): AUC = 0.99
R2L (Class 3): AUC = 0.97
U2R (Class 4): AUC = 0.82
High Performance:
Classes 0 and 1 have perfect AUC scores of
1.00, indicating excellent classification
performance with no false positives.
Class 2 also demonstrates very high
performance with an AUC of 0.99.
Moderate Performance:
Class 3 has an AUC of 0.97, showing strong
performance with minimal false positives.
Class 4 has a lower AUC of 0.82, indicating
room for improvement in distinguishing this
class from others.
ROC CURVE & FEATURE IMPORTANCES RESULT
Dominant Feature: The srv_count feature
overwhelmingly dominates the feature
importance, indicating it has a critical
impact on the model's performance.
Other Features: Although other features
contribute less, they still hold importance for
the model, affecting specific aspects of its
predictions.
CONCLUSION
Effectiveness of Machine Learning Algorithms:
This project successfully demonstrated that various machine learning
algorithms such as ADABOOST, Naive Bayes, Gradient Boosting, Logistic
Regression, K-Nearest Neighbors (KNN), Support Vector Machine (SVM),
Random Forest, Decision Tree, and Neural Networks (NN) are capable of
detecting anomalies in large and complex datasets.
Recommendations for Further Development:
For future research, it is recommended to apply advanced data
augmentation techniques and feature engineering to further enhance
model performance. Additionally, combining multiple algorithms or using
ensemble approaches can help improve the accuracy and robustness of
anomaly detection.
REFERENCES
Astekin, M., Zengin, H., & Sözer, H. (2018). Evaluation of Distributed Machine Learning Algorithms for Anomaly Detection from
Large-Scale System Logs: A Case Study. Proceedings of the IEEE International Conference on Big Data, 862-1967. Get In Touch
With Us
Cavallaro, C., Cutello, V., Pavone, M., & Zito, F. (2023). Discovering anomalies in big data: a review focused on the application
of metaheuristics and machine learning techniques. Frontiers in Big Data, 6. Get In Touch With Us
Shabat, G., Segev, D., & Averbuch, A. (2017). Uncovering Unknown Unknowns in Financial Services Big Data by Unsupervised
Methodologies: Present and Future trends. Proceedings of the Machine Learning Research, 71, 8-19. Get In Touch With Us
Siegel, B. (2020). Industrial Anomaly Detection: A Comparison of Unsupervised Neural Network Architectures. IEEE Sensors
Journal, 4(8), 1-4. Get In Touch With Us
Zoppi, T., Ceccarelli, A., & Bondavalli, A. (2020). Into the Unknown: Unsupervised Machine Learning Algorithms for Anomaly-
Based Intrusion Detection. Proceedings of the IEEE/IFIP International Conference on Dependable Systems and Networks (DSN),
50200, 44. Get In Touch With Us
10
THANK YOU FOR
YOUR ATTENTION