Proactive Threat Hunting with ML Techniques
Proactive Threat Hunting with ML Techniques
Article
Proactive Threat Hunting in Critical Infrastructure Protection
through Hybrid Machine Learning Algorithm Application
Ali Shan 1 and Seunghwan Myeong 2, *
1 Center of Security Convergence & eGovernance, Inha University, Incheon 22212, Republic of Korea;
kralishan@[Link]
2 Department of Public Administration, Inha University, Incheon 22212, Republic of Korea
* Correspondence: shmyeong@[Link]
Abstract: Cyber-security challenges are growing globally and are specifically targeting critical
infrastructure. Conventional countermeasure practices are insufficient to provide proactive threat
hunting. In this study, random forest (RF), support vector machine (SVM), multi-layer perceptron
(MLP), AdaBoost, and hybrid models were applied for proactive threat hunting. By automating
detection, the hybrid machine learning-based method improves threat hunting and frees up time
to concentrate on high-risk warnings. These models are implemented on approach devices, access,
and principal servers. The efficacy of several models, including hybrid approaches, is assessed. The
findings of these studies are that the AdaBoost model provides the highest efficiency, with a 0.98 ROC
area and 95.7% accuracy, detecting 146 threats with 29 false positives. Similarly, the random forest
model achieved a 0.98 area under the ROC curve and a 95% overall accuracy, accurately identifying
132 threats and reducing false positives to 31. The hybrid model exhibited promise with a 0.89 ROC
area and 94.9% accuracy, though it requires further refinement to lower its false positive rate. This
research emphasizes the role of machine learning in improving cyber-security, particularly for critical
infrastructure. Advanced ML techniques enhance threat detection and response times, and their
continuous learning ability ensures adaptability to new threats.
proactive approach, used to analyze large datasets in real-time and spot unusual patterns
and behaviors [10]. By innovations of these technologies, cyber-defense gains a proactive
component, replacing reactive tactics that find it difficult to keep up with the constantly
shifting threat landscape [11]. To combat diverse security breaches, a variety of research
initiatives have been performed in distinct cyber-sectors, each with unique features and
characteristics. To facilitate infrastructure security, this research attempts to implement a
proactive strategy using machine learning. Assessments were conducted on the effective-
ness of many sophisticated machine learning models, including random forest, support
vector machine, hybrid machine learning, AdaBoost, and multi-layer perception models.
To find the optimal model for threat hunting, comparisons between various models were
also made. Furthermore, these models continuously learn from new incoming threats,
improving the model accuracy and performance with respect to time.
Research Gap
It is evident from previous studies that a significant amount of study has been con-
ducted in the fields of cyber-security. Methodological mapping investigations of SLRs
have also been carried out. However, most of the mapping studies that are now available
are about cyber-attack mitigation through different mathematical techniques. There is no
systematic mapping research that compiles information on proactive cyber-security attack
measurements through hybrid machine learning. This mapping project is being carried out
to close the gap by giving researchers a general understanding of the current cyber-security
vulnerabilities and methods for detection and mitigation. To fill these research gaps, this
study focused on the following research questions:
RQ1: How well do the machine learning models handle real-time data from critical
infrastructure sources?
RQ2: How successful are the individual machine learning algorithms (RF, SVM, MLP,
AdaBoost) in identifying and mitigating threats in CI protection?
RQ3: How accurate and effective is the hybrid ML model at detecting threats compared
to individual models?
RQ4: How do models adapt to new and evolving threats when processing real-time
data for threat hunting?
RQ5: Which model is found to be the best optimized in terms of testing and valida-
tion strategies?
These research questions aim to critically explore the multifaceted aspects of imple-
menting machine learning models as a proactive approach for threat hunting in critical
infrastructure protection. The main contributions of this study are as follows:
• Comprehensive analysis of the role of various techniques including proactive, mathe-
matical, machine learning, and hybrid strategies for threat detection.
• Development of machine learning models (RF, SVM, MLP, AdaBoost, and hybrid) to
increase the attack detection accuracy and robustness in critical infrastructure.
• Comparison of the effectiveness of all models using ROC, precision, recall, accuracy,
F1-score, and learning curves. Development of the most optimized models that deal
with real-world scenarios to detect cyber-attacks with reduced false situations.
The rest of the paper includes Section 2, which comprises a brief literature review
regarding cyber-security, mathematics, machine learning, and hybrid solutions as proactive
approaches for intrusion detection. Section 3 describes the general methodology considered
in this work. Section 4 includes the results, discussion, and comparative analysis, followed
by concluding remarks.
2. Literature Review
2.1. Cyber-Security
“Cyber-security defined as the protection against unwanted attacks, harm, or transfor-
mation of data in a system and also the safety of systems themselves” [12]. It is concerned
with the security and privacy of digital assets, including networks, computers, and data that
Sensors 2024, 24, 4888 3 of 24
are processed, stored, and transferred via Internet-based information systems, according
to ISACA. The Worldwide Telecommunications Union defines cyber-security as the set
of methods, guidelines, protocols, best practices, and procedures used to safeguard users’
online assets and organizations [13]. Cyber-security, according to the Merriam-Webster
definition, is the defense of computer systems against intrusions and illegal access [14,15].
Cyber-security comprises the techniques and equipment used to defend computer net-
works and devices from assaults and illegal access over the Internet. Cyber-security is the
defense against unauthorized access to an organization’s non-physical and physical com-
ponents. Diverse definitions among scholars indicate how they define cyber-security [16].
The current definitions concentrate on several facets of cyber-security. Several definitions,
for instance, emphasize privacy and protection, while others emphasize the necessity of
establishing guidelines and procedures for availability, confidentiality, and information
integrity. Cyber-security may be viewed as a defense against unwanted access to the assets
of people and organizations. The significance of the cyber-ecosystem and its preservation
is further emphasized by these concepts [17].
• SQL injection attack: To alter or manipulate an SQL query to the attacker’s benefit,
an input string is introduced through the application in this attack. The database
is harmed by this assault in several ways, including sensitive data exposure and
unauthorized access and modification [29]. This assault is dangerous since it has
the potential to disrupt functionality and secrecy through data loss or unauthorized
organizations misusing the data. Moreover, this type of assault also executes orders at
the system level, which prevents authorized users from gaining access to the necessary
data [30].
Critical infrastructure includes both physical and cyber-systems that are necessary
for a society’s basic functions and security [42]. These systems include those related to
electricity, water, transportation, telecommunications, and healthcare. Vulnerabilities in the
energy industry result from possible assaults on infrastructure used for the production and
distribution of power [43]. Cyber-attacks targeting water treatment controls and system
pollution are two problems that might affect water infrastructure [44]. Physical structural
disturbances or cyber-attacks targeting operating systems can have a significant impact
on transportation networks, encompassing seaports and airports. Infrastructure related
to telecommunications, which is necessary for emergency response and communication,
is vulnerable to both physical damage and cyber-attacks. Cyber-attacks that aim to com-
promise sensitive data and cause service interruptions are especially dangerous for the
healthcare industry, the vitality of this infrastructure, and its growing dependence [45].
Sensors 2024, 24, 4888 5 of 24
network anomalies in ambiguous or incomplete data. These hybrid models enable cyber-
security systems to process vast and varied datasets, recognize complex patterns, adapt
to new threats, and enforce security protocols efficiently, resulting in more robust and
intelligent cyber-security solutions [61]. Different business companies depend upon cyber-
security experts that are known as threat hunters. These security experts defend all types
of cyber-attacks in a timely manner, even zero-day attacks, with real-time data [62]. To
improve business security, most organizations base their systems on artificial intelligence.
Although different types of machine learning model are used for cyber-security, not all of
these models are used for proactive techniques based on real-time data [63].
Table 2. Cont.
3. Methodology
There are various phases of the methodology for distributed and scalable machine
learning-based systems that are used for proactive threat hunting in critical infrastructure.
These phases of the methodology include data collection, architecture, data pre-processing,
selection and training of the machine learning model, model validation, and performance
evaluation of the models, as given in detail below. Every phase of the methodology is
designed for unique challenges and critical infrastructure to evaluate the real-time threats.
therefore, these were also detected and removed, which improved the accuracy of the
models. The isolation forest model was utilized to detect and further handle the outliers
in the dataset. Data encryption was used to ensure the security of the data and make it
easier for various ML models to work together while protecting sensitive information from
online threats.
True Positive
Precision = (1)
True Positive + False Positive
Number of correct prediction
Accuracy = (2)
Total number of Predictions
True Positive
Recall = (3)
True positive + False Negative
Precision ∗ Recall
F1 score = (4)
Precision + Recall
Moreover, the pair-plot of the dataset (Figure 2) provided the visual representation of
the relationship between pairs of features in the dataset. Plots of this kind are especially
helpful in determining probable patterns, correlations, and distributions between various
variables. The figure indicates that the card_present_flag plot shows no significant link
with other features. When the card is present, the transaction appears to be more dispersed
throughout a larger range of amounts, which are rather concentrated at smaller amounts.
The distribution of the balance variable is right-skewed, indicating that while most people
have smaller balances, a small percentage have noticeably higher amounts. The relationship
between balance and other variables is more complex in the scatter plot. The distribution
of the age variable is slightly skewed to the left, showing a higher percentage of younger
people, whereas the amount variable is heavily skewed to the right, indicating a few
high-value transactions. These skewed distributions are clear indications of the presence
of outliers. Potential outliers are highlighted by the scatter plots, particularly in terms
of amount and balance. These outliers can require attention during data pre-processing,
or they might be areas of interest for additional investigation. Plots with concentrated
points in specific places may indicate common trends or clusters that could be helpful for
analyzing customer behavior or developing prediction models.
indicating a few high-value transactions. These skewed distributions are clear indications
of the presence of outliers. Potential outliers are highlighted by the scatter plots, particu-
larly in terms of amount and balance. These outliers can require attention during data pre-
processing, or they might be areas of interest for additional investigation. Plots with con-
Sensors 2024, 24, 4888 10 ofbe
24
centrated points in specific places may indicate common trends or clusters that could
helpful for analyzing customer behavior or developing prediction models.
Figure 2. Pair-plots
Pair-plots of the dataset.
Figure Correlation
3. 3.
Figure Correlationmatrix
matrixof
ofthe
the dataset.
dataset.
Figure 4 indicates the number of outliers detected over time. The number of outliers
found is considerable in the beginning and peaks at about 20. This suggests that the data
may have experienced some initial instability or noise. Over time, there are noticeable
swings in the number of outliers, with both larger and lower outlier counts. This variabil-
ity shows that throughout time, the nature or quality of the data may vary. The number
of outliers jumps at a few significant points in the data, including the beginning and the
end of the period. These peaks might point to occurrences, irregularities, or notable mod-
ifications in the underlying process that produced the data during the data collection pro-
cess.
Figure Detection
4. 4.
Figure Detectionof
ofoutliers
outliers in the
the dataset.
dataset.
[Link].
Threat
ThreatDetection
Detectionby
byML
ML Models
Models
TheThenext
nextstep
stepisisthe
the threat
threat detection
detection bybyutilizing
utilizingMLML classifiers.
classifiers. Various
Various ML ML models
models
were applied to identify threats or anomalies in the real-time dataset of critical infrastructure.
were applied to identify threats or anomalies in the real-time dataset of critical infrastruc-
These
[Link] RF, SVM,
These include RF,MLP,
SVM,AdaBoost, and hybrid
MLP, AdaBoost, models,
and hybrid whichwhich
models, were assessed for threat
were assessed
hunting, andhunting,
for threat their performance was compared.
and their performance wasThese reconstruction-based
compared. models provide
These reconstruction-based
greater
models sensitivity, enabling
provide greater more threat
sensitivity, detection.
enabling The performance
more threat detection. Theofperformance
these models of was
these models was analyzed by utilizing the confusion matrix, ROC curve,
analyzed by utilizing the confusion matrix, ROC curve, and precision–recall curve. The and precision–
recallof
details curve. The details
all these modelsofare all these models are as follows.
as follows.
Confusionmatrix
[Link]
Figure matrixofof random
random forest
forest model.
model.
The ROC and precision–recall curves of the RF model are shown in Figure 6 below.
The ROC curve plots the true positive rate against the false positive rate across various
thresholds. The area under the ROC curve determines the degree of the model’s quality and
discriminates whether the model satisfies the specific conditions or not. The area greater
than 0.98 indicates excellent performance of RF in terms of threat identification. However,
the F1-score is about 0.7, which indicates accurate detection of threats at first, but precision
is lost by identifying more false positives. The model is quite good at differentiating threats
and normal data, as evidenced by its high AUC value. The F1-score exhibits reasonable
balance but precision becomes compromised as recall rises. The detailed classification
report of the evaluation metrics is given in Table 4. This table indicates greater precision,
of 0.960, for normal data identification and 0.809 for anomaly detection. Moreover, high
overall accuracy of 0.950 is observed for this model. Overall, these measures collectively
suggests that the RF model performed well for threat hunting, with a robust ability to
minimize false positives.
[Link]
Figure ROC and
and precision–recall
precision–recall curve
curveofofrandom
randomforest
forestmodel.
model.
4.3.2.
TableSupport Vector
4. Evaluation Machine
metrics (SVM)
for random Model
forest models.
The comprehensive assessment of the
Precision model’s accuracy
Recall and reliability
F1-Score was also con-
Support
ducted using a confusion matrix (Figure 7). A significant number of normal data (2092)
0 0.960 0.985 0.972 2100
were correctly identified by SVM model, indicating its good performance in identifying
1 0.809 0.611 0.696 216
normal instances. Very few errors (8) were observed as false positives. However, the
Accuracy 0.950 0.950 0.950 0.950
model failed to identify a considerable number of actual threats (188), which is a critical
Sensors 2024, 24, x FOR PEER REVIEW macro avg 0.885 0.798 0.834 2316
15 of 29
concern for threat detection application. This model successfully captured a small number
Weighted avg 0.946 0.950 0.947 2316
of threats (28), highlighting the need for improvement.
4.3.2. Support Vector Machine (SVM) Model
The comprehensive assessment of the model’s accuracy and reliability was also con-
ducted using a confusion matrix (Figure 7). A significant number of normal data (2092)
were correctly identified by SVM model, indicating its good performance in identifying
normal instances. Very few errors (8) were observed as false positives. However, the
model failed to identify a considerable number of actual threats (188), which is a critical
concern for threat detection application. This model successfully captured a small number
of threats (28), highlighting the need for improvement.
Confusionmatrix
[Link]
Figure matrixofof support
support vector
vector machine
machine model.
model.
Theefficiency
The efficiencyofofthe
theSVM
SVM model
model was
was also
also analyzed
analyzed usingusing
ROC ROC
and and precision–recall
precision–recall
curves (Figure 8). The area under the curve (ROC) of the SVM model is
curves (Figure 8). The area under the curve (ROC) of the SVM model is 0.82, suggesting 0.82, suggesting that
the model
that can can
the model distinguish between
distinguish betweennormal
normaland anomalous
and anomalous cases
caseswith
withreasonable
reasonableaccuracy.
ac-
The ROC
curacy. Thecurve’s shape indicates
ROC curve’s that, across
shape indicates a rangea of
that, across thresholds,
range the model
of thresholds, maintains
the model
a high true
maintains a positive
high truerate whilerate
positive limiting
while the false the
limiting positive rate. Therate.
false positive precision–recall
The precision–curve
plot iscurve
recall also plot
essential
is alsowhere detecting
essential the anomalies
where detecting is crucial.
the anomalies The maximum
is crucial. The maximum F1-score
achievedachieved
F1-score by the model
by theismodel
0.22. This value
is 0.22. Thisreflects the trade-off
value reflects between
the trade-off precision
between and recall,
precision
emphasizing
and the difficulty
recall, emphasizing the of detecting
difficulty anomalies.
of detecting The model
anomalies. Theachieves excellentexcel-
model achieves precision
lent precision at first, but as recall increases, it decreases. This shows that although the
model may correctly detect some abnormalities, a higher proportion of true positives is
accompanied by a higher proportion of false positives. The curve’s downward trend sug-
gests that as the model tried to capture more true threats (high recall), the proportion of
false positives also increased, reducing precision. Table 5 indicates the efficiency of the
model in terms of evaluation metrics. This model exhibited high precision (0.917) for nor-
Sensors 2024, 24, 4888 14 of 24
at first, but as recall increases, it decreases. This shows that although the model may
correctly detect some abnormalities, a higher proportion of true positives is accompanied
by a higher proportion of false positives. The curve’s downward trend suggests that as
the model tried to capture more true threats (high recall), the proportion of false positives
also increased, reducing precision. Table 5 indicates the efficiency of the model in terms of
Sensors 2024, 24, x FOR PEER REVIEW
evaluation metrics. This model exhibited high precision (0.917) for normal data16detection,
of 29
and low precision (0.77) for abnormal data, with 0.915 accuracy.
[Link]
Figure ROC and
and precision–recall
precision–recall curve
curveofofsupport
supportvector
vectormachine
machinemodel.
model.
Figure
Figure 9.
[Link]
Confusionmatrix
matrixofofmulti-layer
multi-layerperceptron model.
perceptron model.
Figure10.
Figure [Link]
ROCand
andprecision–recall
precision–recall curves
curves of
ofmulti-layer
multi-layerperceptron
perceptronmodel.
model.
[Link]
Table Evaluationmetrics
metrics for
for multi-layer
multi-layer perception
perceptionmodel.
model.
Precision Recall F1-Score Support
Precision Recall F1-Score Support
0 0.963 0.558 0.706 2100
01 0.963
0.155 0.558
0.791 0.706
0.260 2100
216
1 0.155 0.791 0.260 216
accuracy
accuracy
0.579
0.579
0.579
0.579
0.579
0.579
0.579
0.579
macroavg
macro avg 0.559
0.559 0.674
0.674 0.483
0.483 2316
2316
Weightedavg
Weighted avg 0.887
0.887 0.579
0.579 0.665
0.665 2316
2316
[Link]
4.3.4. AdaBoostModel
Model
The confusion matrix of AdaBoost model is shown in Figure 11. The model’s excellent
The confusion matrix of AdaBoost model is shown in Figure 11. The model’s excellent
recognition of non-anomalous data is demonstrated by its exact identification of 2071 nor-
recognition of non-anomalous data is demonstrated by its exact identification of 2071 nor-
mal cases (class 0). It is important for practical anomaly detection systems to minimize
mal cases (class 0). It is important for practical anomaly detection systems to minimize
disruptions caused by false alerts, and the low frequency of false positives (29) shows that
disruptions caused by false alerts, and the low frequency of false positives (29) shows
the model does not frequently raise unnecessary alarms. The model properly found 146
that the model does not frequently raise unnecessary alarms. The model properly found
real anomalies. This suggests a reasonable sensitivity level and shows that the model can
146 real anomalies. This suggests a reasonable sensitivity level and shows that the model
identify true abnormalities in the dataset. Seventy real abnormalities (70) were missed by
can identify true abnormalities in the dataset. Seventy real abnormalities (70) were missed
the model, which mistakenly classified them as typical occurrences. Even though this
bynumber
the model, which
is small, it ismistakenly
neverthelessclassified them
noteworthy as typical
since occurrences.
false negatives Even detection
in anomaly though this
can have detrimental effects.
The area under the ROC curve of 0.98 signifies excellent performance, as shown in
Figure 12. The curve’s proximity to the top right corner denotes an elevated true positive
rate and a low false positive rate, highlighting the model’s robustness in recognizing
anomalies and reducing false alarms. Similarly, the model achieves the maximum F1-
Sensors 2024, 24, 4888 16 of 24
Figure11.
Figure Confusion
[Link] matrix
matrix of AdaBoost
of AdaBoost model.
model.
The area under the ROC curve of 0.98 signifies excellent performance, as shown in
Figure 12. The curve’s proximity to the top right corner denotes an elevated true positive
rate and a low false positive rate, highlighting the model’s robustness in recognizing
anomalies and reducing false alarms. Similarly, the model achieves the maximum F1-score
Sensors 2024, 24, x FOR PEER REVIEW
of 0.75, as indicated by the precision–recall curve. The pattern of the curve 20 of 29
shows that
the model performs well up until recall increases to a point where precision begins to
decline more sharply. The overall metrics calculated from these curves are given in Table 7,
exhibiting the highest precision values of 0.96 and 0.83 for 0 and 1 classes, respectively.
Figure12.
Figure [Link]
ROC and precision–recall
precision–recallcurves
curvesofof
AdaBoost model.
AdaBoost model.
[Link]
Table metricsfor
Evaluation metrics forAdaBoost
AdaBoostmodels.
models.
Precision
Precision Recall
Recall F1-Score
F1-Score Support
Support
00 0.967
0.967 0.986
0.986 0.976
0.976 2100 2100
11 0.834
0.834 0.675
0.675 0.746
0.746 216 216
Accuracy
Accuracy 0.957
0.957 0.957
0.957 0.957
0.957 0.9570.957
macro avg 0.900 0.831 0.861 2316
macro avg
Weighted avg 0.900
0.954 0.831
0.957 0.861
0.955 2316 2316
Weighted avg 0.954 0.957 0.955 2316
Figure13.
Figure Confusionmatrix
[Link] matrixofof hybrid
hybrid model.
model.
Moreover,
Moreover,when when evaluation curvesof
evaluation curves ofhybrid
hybridmodel
modelwerewereevaluated
evaluated (Figure
(Figure 14),14),
thethe
area
area for ROC curves was found to be 0.89. This signifies the strong performance of this of
for ROC curves was found to be 0.89. This signifies the strong performance
this model,
model, withwith a higher
a higher true true positive
positive ratefewer
rate and and fewer false positive
false positive [Link]. Thiscon-
This is also is also
confirmed
firmed byby thethe sharp
sharp decrease
decrease in in precision
precision in the
in the precision–recall
precision–recall curve,
curve, indicating
indicating an F1-an F1-
score
scoreofof0.70.
[Link]
Thevalues
valuesofofprecision
precisionand
andaccuracy
accuracyforforcyber-attack
cyber-attackdetection
detectionwere
werecalculated
calcu-
tolated
be 0.786
to beand
0.7860.949, respectively,
and 0.949, as shown
respectively, in Table
as shown 8. These
in Table indicators
8. These indicatorsshow
showthat
thatthe
hybrid model
the hybrid can can
model identify cyber-attacks
identify cyber-attackswith
withaa high degreeofofaccuracy;
high degree accuracy; nevertheless,
nevertheless,
further
furtherwork
workneeds
needstoto be
be done
done to improve
improve overall
overallprecision
precisionand andminimize
minimize false
false positives.
positives.
Figure [Link]
Figure14. ROCand
andprecision–recall curves of
precision–recall curves ofhybrid
hybridmodel.
model.
Sensors 2024, 24, 4888 18 of 24
1.00
0.98 0.98
0.95
Area under ROC curve
0.90
0.89
0.85
0.82
0.80
0.76
0.75
The
The comparison
comparison of of the
the precision,
precision, recall,
recall, and
and F1-score
F1-score for
for anomalous
anomalous (Figure
(Figure 17a)17a) and
and
normal
normal data (Figure 17b) detection, along with the accuracy of each model,
17b) detection, along with the accuracy of each model, is given in the is given in
the form of bar graphs shown in Figure 17 below. This also indicates
form of bar graphs shown in Figure 17 below. This also indicates that the AdaBoost algo- that the AdaBoost
algorithm
rithm depictsdepicts highest
highest precision
precision and accuracy
and accuracy for anomaly
for anomaly detection.
detection. Moreover,
Moreover, the
the value
value for recall
for recall and accuracy
and accuracy is below
is below 0.8still
0.8 but but greater
still greater
thanthan
thatthat ofothers.
of all all others.
ThisThis exhibits
exhibits the
the appreciable
appreciable effectiveness
effectiveness ofAdaBoost
of the the AdaBoostmodel model in differentiating
in differentiating the normal
the normal and anom-and
anomalous data from
alous data from a real-time
a real-time dataset
dataset in cyber-security.
in cyber-security. ThisThis efficiency
efficiency trend
trend is followed
is followed by
by
RF and hybrid models, which exhibit almost equal efficiency to that of the AdaBoost
RF and hybrid models, which exhibit almost equal efficiency to that of the AdaBoost
model
model inin terms
terms of
of these
these evaluation
evaluation metrics.
metrics. Moreover,
Moreover, the the SVM
SVM model
model shows
shows lower
lower recall
recall
and F1-score for threat detection. Furthermore, the worst performance
and F1-score for threat detection. Furthermore, the worst performance is demonstrated by is demonstrated
by
thethe MLP
MLP model,
model, with
with smallprecision
small precisionand andF1-score
F1-scoreforfor cyber-threat
cyber-threat detection
detection in in critical
critical
infrastructure.
infrastructure. Figure 17b indicates evaluation metrics for normal data detection. ThisThis
Figure 17b indicates evaluation metrics for normal data detection. fig-
figure clearly
ure clearly shows
shows thatthat all models
all models indicate
indicate reasonably
reasonably high performance
high performance in terms in ofterms
normal of
normal data detection without any errors, while the MLP algorithm
data detection without any errors, while the MLP algorithm also indicated small recall also indicated small
recall and F1-score for normalized data detection in the real-time dataset. By considering
and F1-score for normalized data detection in the real-time dataset. By considering the
the learning curves, evaluation metrics, and ROC curve area, AdaBoost outperformed all
learning curves, evaluation metrics, and ROC curve area, AdaBoost outperformed all
other models and was found to be the most optimized for threat hunting. By analyzing
other models and was found to be the most optimized for threat hunting. By analyzing all
all these metrics and ROC curve patterns, we concluded the following order of model
these metrics and ROC curve patterns, we concluded the following order of model per-
performances in terms of cyber-threat detection: AdaBoost > RF > hybrid > SVM > MLP.
formances in terms of cyber-threat detection: AdaBoost > RF > hybrid > SVM > MLP.
Sensors 2024,24,
Sensors2024, 24,4888
x FOR PEER REVIEW 26 21
of of
2924
Figure 17. Comparison of evaluation metrics of (a) anomaly detection and (b) normal data detection
Figure 17. Comparison of evaluation metrics of (a) anomaly detection and (b) normal data detection
for all models.
for all models.
5. Conclusions
5. Conclusions
Early anomaly
Early anomaly detection
detectionininsoftware-defined
software-defined networking
networking hashas
an extensive impact
an extensive on
impact
the network’s operational efficiency. The latest developments in ML aid in
on the network’s operational efficiency. The latest developments in ML aid in effective effective anom-
aly identification
anomaly and improve
identification serviceservice
and improve [Link].
Here, we investigated
Here, the use the
we investigated of RF,
useSVM,
of RF,
MLP, AdaBoost and hybrid machine learning models in tandem for
SVM, MLP, AdaBoost and hybrid machine learning models in tandem for identifying identifying anomalies
and offer aand
anomalies thorough
offer a overview
thorough of networkofarchitecture.
overview Firstly, weFirstly,
network architecture. talk about the limits
we talk about of
the
the current
limits of the methods and theand
current methods significance of identifying
the significance anomalies
of identifying in contemporary
anomalies net-
in contemporary
works. WeWe
networks. outline theirtheir
outline fundamental
fundamentalidea,idea,
possible uses,uses,
possible advantages, and drawbacks.
advantages, Ad-
and drawbacks.
ditionally, we included a thorough synopsis of those
Additionally, we included a thorough synopsis of those [Link].
• This study emphasizes the crucial role of ML in bolstering cyber-security for critical
• This study emphasizes the crucial role of ML in bolstering cyber-security for critical
infrastructure.
infrastructure.
•• Random forest and AdaBoost models displayed exceptional performance, each with
Random forest and AdaBoost models displayed exceptional performance, each with a
a 0.98 ROC area and overall accuracies of 95% and 95.7%, respectively.
0.98 ROC area and overall accuracies of 95% and 95.7%, respectively.
•• The hybrid model showed potential, with a 0.89 ROC area and 94.9% accuracy, alt-
The hybrid model showed potential, with a 0.89 ROC area and 94.9% accuracy, al-
hough it requires improvement to lower false positives.
though it requires improvement to lower false positives.
•• ML models’ continuous learning capabilities ensure that they can adapt to new and
ML models’ continuous learning capabilities ensure that they can adapt to new and
emerging threats, enhancing the accuracy and speed of threat detection.
emerging threats, enhancing the accuracy and speed of threat detection.
•• Our work sheds light on how to build optimized autonomous models that can protect
Our work sheds light on how to build optimized autonomous models that can protect
the system from sophisticated cyber-attacks. Future studies should try to replicate
the system from sophisticated cyber-attacks. Future studies should try to replicate
this study across a range of operational contexts and data variations in more general
this study across a range of operational contexts and data variations in more general
scenarios. We used learning curves to assess the model feasibility in terms of threat
scenarios. We used learning curves to assess the model feasibility in terms of threat
detection. Gaining a knowledge of these curves is crucial for recognizing bias–vari-
detection. Gaining a knowledge of these curves is crucial for recognizing bias–variance
ance trade-offs, possible overfitting or underfitting problems, and learning behaviors
trade-offs, possible overfitting or underfitting problems, and learning behaviors of
of the models.
the models.
Author Contributions: Conceptualization, A.S.; Methodology, A.S.; Validation, A.S. and S.M.; For-
Author Contributions:
mal analysis, Conceptualization,
A.S.; Investigation, A.S.; Resources,
A.S. and S.M.; Methodology, A.S.;
S.M.; DataValidation,
curation, A.S.
A.S.;and S.M.; Formal
Writing—orig-
analysis,
inal draft,A.S.;
A.S.;Investigation,
Writing—reviewA.S. &
and S.M.; A.S.;
editing, Resources, S.M.; Data
Visualization, curation,
A.S.; A.S.; Writing—original
Supervision, S.M.; Project ad-
draft, A.S.; Writing—review
ministration, S.M.; Funding&acquisition,
editing, A.S.; Visualization,
S.M. All authorsA.S.;
haveSupervision, S.M.; Project
read and agreed administra-
to the published
version
tion, of the
S.M.; manuscript.
Funding acquisition, S.M. All authors have read and agreed to the published version of
the manuscript.
Funding: This work was supported by the Ministry of Education of the Republic of Korea and the
National Research
Funding: This work Foundation of Korea
was supported (NRF-2022S1A5C2A03093690).
by the Ministry of Education of the Republic of Korea and the
National Research Foundation of Korea (NRF-2022S1A5C2A03093690).
Data Availability Statement: Data is available on request.
Data Availability
Conflicts Statement:
of Interest: Datadeclare
The authors is available on request.
no conflict of interest.
Conflicts of Interest: The authors declare no conflict of interest.
Sensors 2024, 24, 4888 22 of 24
References
1. Prokopowicz, D.; Goł˛ebiowska, A. Increase in the Internetization of economic processes, economic, pandemic and climate crisis
as well as cybersecurity as key challenges and philosophical paradigms for the development of the 21st century civilization. J.
Mod. Sci. 2021, 47, 307–344. [CrossRef]
2. Ruposky, T.J. The Exponential Rise of Cybercrime. Univ. Cent. Fla. Dep. Leg. Stud. Law J. 2022, 5, 137.
3. Jain, A.K.; Sahoo, S.R.; Kaubiyal, J. Online social networks security and privacy: Comprehensive review and analysis. Complex
Intell. Syst. 2021, 7, 2157–2177. [CrossRef]
4. Aslan, Ö.; Aktuğ, S.S.; Ozkan-Okay, M.; Yilmaz, A.A.; Akin, E. A comprehensive review of cyber security vulnerabilities, threats,
attacks, and solutions. Electronics 2023, 12, 1333.
5. Khadidos, A.O.; AlKubaisy, Z.M.; Khadidos, A.O.; Alyoubi, K.H.; Alshareef, A.M.; Ragab, M. Binary Hunter–Prey Optimization
with Machine Learning—Based Cybersecurity Solution on Internet of Things Environment. Sensors 2023, 23, 7207. [CrossRef]
6. Nassar, A.; Kamal, M. Machine Learning and Big Data analytics for Cybersecurity Threat Detection: A Holistic review of
techniques and case studies. J. Artif. Intell. Mach. Learn. Manag. 2021, 5, 51–63.
7. Nour, B.; Pourzandi, M.; Debbabi, M. A Survey on Threat Hunting in Enterprise Networks. IEEE Commun. Surv. Tutor. 2023, 25,
2299–2324. [CrossRef]
8. Khordadpour, P. Toward Efficient Protecting Cyber-Physical Systems with Cyber Threat Hunting and Intelligence. TechRxiv 2023.
[CrossRef]
9. Rabbani, M.; Wang, Y.; Khoshkangini, R.; Jelodar, H.; Zhao, R.; Bagheri Baba Ahmadi, S.; Ayobi, S. A review on machine learning
approaches for network malicious behavior detection in emerging technologies. Entropy 2021, 23, 529. [CrossRef] [PubMed]
10. Bhardwaj, A.; Kaushik, K.; Alomari, A.; Alsirhani, A.; Alshahrani, M.M.; Bharany, S. Bth: Behavior-based structured threat
hunting framework to analyze and detect advanced adversaries. Electronics 2022, 11, 2992. [CrossRef]
11. Choo, K.-K.R.; Conti, M.; Dehghantanha, A. Special Issue on Big Data Applications in Cyber Security and Threat Intelligence–Part
1. IEEE Trans. Big Data 2019, 5, 279–281. [CrossRef]
12. Kuhl, M.E.; Sudit, M.; Kistner, J.; Costantini, K. Cyber attack modeling and simulation for network security analysis. In
Proceedings of the 2007 Winter Simulation Conference, Washington, DC, USA, 9–12 December 2007; IEEE: Piscataway, NJ, USA,
2007.
13. Farraj, A.; Hammad, E.; Kundur, D. Impact of Cyber Attacks on Data Integrity in Transient Stability Control. In Proceedings of
the 2nd Workshop on Cyber-Physical Security and Resilience in Smart Grids, Pittsburgh, PA, USA, 21 April 2017.
14. Cheng, M.; Crow, M.; Erbacher, R.F. Vulnerability analysis of a smart grid with monitoring and control system. In Proceedings of
the Eighth Annual Cyber Security and Information Intelligence Research Workshop, Oak Ridge, TN, USA, 8–10 January 2013.
15. Jeffrey, N.; Tan, Q.; Villar, J.R. A review of anomaly detection strategies to detect threats to cyber-physical systems. Electronics
2023, 12, 3283. [CrossRef]
16. Kim, I.; Kim, D.; Kim, B.; Choi, Y.; Yoon, S.; Oh, J.; Jang, J. A case study of unknown attack detection against Zero-day worm in
the honeynet environment. In Proceedings of the 2009 11th International Conference on Advanced Communication Technology,
Gangwon, Republic of Korea, 15–18 February 2009; IEEE: Piscataway, NJ, USA, 2009.
17. Aparicio-Navarro, F.J.; Kyriakopoulos, K.G.; Gong, Y.; Parish, D.J.; Chambers, J.A. Using Pattern-of-Life as Contextual Information
for Anomaly-Based Intrusion Detection Systems; IEEE Access: Piscataway, NJ, USA, 2017; Volume 5, pp. 22177–22193.
18. Aishwarya, R.; Malliga, S. Intrusion detection system-An efficient way to thwart against Dos/DDos attack in the cloud environ-
ment. In Proceedings of the 2014 International Conference on Recent Trends in Information Technology, Chennai, India, 10–12
April 2014; IEEE: Piscataway, NJ, USA, 2014.
19. Al-Dabbagh, A.W.; Li, Y.; Chen, T. An intrusion detection system for cyber attacks in wireless networked control systems. IEEE
Trans. Circuits Syst. II Express Briefs 2017, 65, 1049–1053. [CrossRef]
20. Bhadre, P.; Gothawal, D. Detection and blocking of spammers using SPOT detection algorithm. In Proceedings of the 2014 First
International Conference on Networks & Soft Computing (ICNSC2014), Guntur, India, 19–20 August 2014; IEEE: Piscataway, NJ,
USA, 2014.
21. Bottazzi, G.; Casalicchio, E.; Cingolani, D.; Marturana, F.; Piu, M. MP-shield: A framework for phishing detection in mobile
devices. In Proceedings of the 2015 IEEE International Conference on Computer and Information Technology; Ubiquitous
Computing and Communications; Dependable, Autonomic and Secure Computing; Pervasive Intelligence and Computing,
Liverpool, UK, 26–28 October 2015; IEEE: Piscataway, NJ, USA, 2015.
22. Chen, C.-M.; Hsiao, H.-W.; Yang, P.-Y.; Ou, Y.-H. Defending malicious attacks in cyber physical systems. In Proceedings of the
2013 IEEE 1st International Conference on Cyber-Physical Systems, Networks, and Applications (CPSNA), Taipei, China, 19–20
August 2013; IEEE: Piscataway, NJ, USA, 2013.
23. Trajkovic, L.; Wong, S.; Triphati, S.K.; Lin, K.-J. In Proceedings of the International Computer Symposium (ICS 2010), Tainan,
China, 16–18 December 2010.
24. Chonka, A.; Abawajy, J. Detecting and mitigating HX-DoS attacks against cloud web services. In Proceedings of the 2012 15th
International Conference on Network-Based Information Systems, Melbourne, VIC, Australia, 26–28 September 2012; IEEE:
Piscataway, NJ, USA, 2012.
Sensors 2024, 24, 4888 23 of 24
25. Devi, B.K.; Preetha, G.; Selvaram, G.; Shalinie, S.M. An impact analysis: Real time DDoS attack detection and mitigation using
machine learning. In Proceedings of the 2014 International Conference on Recent Trends in Information Technology, Chennai,
India, 10–12 April 2014; IEEE: Piscataway, NJ, USA, 2014.
26. Eslahi, M.; Hashim, H.; Tahir, N. An efficient false alarm reduction approach in HTTP-based botnet detection. In Proceedings
of the 2013 IEEE Symposium on Computers & Informatics (ISCI), Langkawi, Malaysia, 7–9 April 2013; IEEE: Piscataway, NJ,
USA, 2013.
27. Abraham, S.; Chengalur-Smith, I. An overview of social engineering malware: Trends, tactics, and implications. Technol. Soc.
2010, 32, 183–196. [CrossRef]
28. Goenka, R.; Chawla, M.; Tiwari, N. A comprehensive survey of phishing: Mediums, intended targets, attack and defence
techniques and a novel taxonomy. Int. J. Inf. Secur. 2024, 23, 819–848. [CrossRef]
29. Nasereddin, M.; ALKhamaiseh, A.; Qasaimeh, M.; Al-Qassas, R. A systematic review of detection and prevention techniques of
SQL injection attacks. Inf. Secur.J. A Glob. Perspect. 2023, 32, 252–265. [CrossRef]
30. Alarfaj, F.K.; Khan, N.A. Enhancing the performance of SQL injection attack detection through probabilistic neural networks.
Appl. Sci. 2023, 13, 4365. [CrossRef]
31. Gupta, A.; Gupta, U.; Kumar, A.; Bhushan, B. Analysing Security Threats And Elevating Healthcare Privacy For A Resilient
Future. In Proceedings of the 2023 International Conference on Artificial Intelligence for Innovations in Healthcare Industries
(ICAIIHI), Raipur, India, 29–30 December 2023; IEEE: Piscataway, NJ, USA, 2014.
32. Okigui, H.H. An Analysis of Cyber-Security Policy Compliance in Organisations; Cape Peninsula University of Technology: Cape
Town, South Africa, 2023.
33. Bhadra, S.; Mohammed, S. Cloud Computing Threats and Risks: Uncertainty and Unconrollability in the Risk Socety. Electron. J.
2020, 7, 1047–1071.
34. Bhadra, S. Securing Cloudy Cyberspace: An Overview of Crimes, Threats and Risks. Int. Res. J. Eng. Technol. 2020, 7.
35. Pulyala, S.R. From Detection to Prediction: AI-powered SIEM for Proactive Threat Hunting and Risk Mitigation. Turk. J. Comput.
Math. Educ. (TURCOMAT) 2024, 15, 34–43. [CrossRef]
36. Tahmasebi, M. Beyond Defense: Proactive Approaches to Disaster Recovery and Threat Intelligence in Modern Enterprises. J. Inf.
Secur. 2024, 15, 106–133. [CrossRef]
37. George, A.S.; Baskar, T.; Srikaanth, P.B. Cyber Threats to Critical Infrastructure: Assessing Vulnerabilities Across Key Sectors.
Partn. Univers. Int. Innov. J. 2024, 2, 51–75.
38. Kumar, P.; Javeed, D.; Kumar, R.; Islam, A.N. Blockchain and explainable AI for enhanced decision making in cyber threat
detection. Softw. Pract. Exp. 2024, 54. [CrossRef]
39. Almahmoud, Z.; Yoo, P.D.; Alhussein, O.; Farhat, I.; Damiani, E. A holistic and proactive approach to forecasting cyber threats.
Sci. Rep. 2023, 13, 8049. [CrossRef]
40. Nuiaa, R.R.; Manickam, S.; Alsaeedi, A.H.; Alomari, E.S. A new proactive feature selection model based on the enhanced
optimization algorithms to detect DRDoS attacks. Int. J. Electr. Comput. Eng. 2022, 12, 869–1880. [CrossRef]
41. Gautam, A.S.; Gahlot, Y.; Kamat, P. Hacker forum exploit and classification for proactive cyber threat intelligence. In Inventive
Computation Technologies; Springer: Berlin/Heidelberg, Germany, 2020; Volume 4.
42. AlHidaifi, S.M.; Asghar, M.R.; Ansari, I.S. A Survey on Cyber Resilience: Key Strategies, Research Challenges, and Future
Directions. ACM Comput. Surv. 2024, 56, 1337–1360. [CrossRef]
43. Awotunde, J.B.; Folorunso, S.O.; Imoize, A.L.; Odunuga, J.O.; Lee, C.C.; Li, C.T.; Do, D.T. An ensemble tree-based model for
intrusion detection in industrial internet of things networks. Appl. Sci. 2023, 13, 2479. [CrossRef]
44. Ravi, R.; Shekhar, B. Sql Vulnerability Prevention in Cybercrime Using Dynamic Evaluation of Shell and Remote File Injection
Attacks. Int. J. Adv. Res. Biol. Ecol. Sci. Technol. 2015, 1.
45. Krishnan, S. A Hybrid Approach to Threat Modelling. Available online: [Link]
3/[Link] (accessed on 10 July 2018).
46. Mushore, K. Security Concerns in Implementing Service Oriented Architecture: A Game Theoretical Analysis; University of Johannesburg:
Johannesburg, South Africa, 2015.
47. Arora, A. Towards Safeguarding Users Against Phishing and Ransomware Attacks; The University of Alabama at Birmingham:
Birmingham, AL, USA, 2019.
48. Yau, S.S.; Buduru, A.B.; Nagaraja, V. Protecting critical cloud infrastructures with predictive capability. In Proceedings of the 2015
IEEE 8th International Conference on Cloud Computing, New York, NY, USA, 27 June–2 July 2015; IEEE: Piscataway, NJ, USA,
2015.
49. Jafri, S.M.R.; Khan, A.; Talpur, F.; Saeed, S.; Khan, M.A. Information Security in modern way of education system in Pakistan. Int.
J. Technol. Res. 2015, 3, 42.
50. Meryem, A.; Ouahidi, B.E. Hybrid intrusion detection system using machine learning. Netw. Secur. 2020, 2020, 8–19. [CrossRef]
51. Shaukat, S.U. Optimum Parameter Machine Learning Classification and Prediction of Internet of Things (IoT) Malwares Using Static
malware Analysis Techniques; University of Salford: Salford, UK, 2018.
52. Rathod, T.; Jadav, N.K.; Tanwar, S.; Polkowsk, Z.; Yamsa, N.; Sharm, R.; Alqahtan, F.; Gafa, A. AI and Blockchain-Based Secure
Data Dissemination Architecture for IoT-Enabled Critical Infrastructure. Sensors 2023, 23, 8928. [CrossRef]
Sensors 2024, 24, 4888 24 of 24
53. Fatani, A.; Dahou, A.; Abd Elaziz, M.; Al-Qaness, M.A.; Lu, S.; Alfadhli, S.A.; Alresheedi, S.S. Enhancing intrusion detection
systems for IoT and cloud environments using a growth optimizer algorithm and conventional neural networks. Sensors 2023, 23,
4430. [CrossRef]
54. Singh, J.; Singh, J. Detection of malicious software by analyzing the behavioral artifacts using machine learning algorithms. Inf.
Softw. Technol. 2020, 121, 106273. [CrossRef]
55. Kiourkoulis, S. DDoS datasets: Use of Machine Learning to Analyse Intrusion Detection Performance. Master’s Thesis, Luleå
University of Technology, Luleå, Sweden, 2020.
56. Alfawareh, M.D. Cyber Threat Intelligence Using Deep Learning to Detect Abnormal Network Behavior. Ph.D. Thesis, Princess
Sumaya University for Technology, Amman, Jordan, 2020.
57. Calvet, L.; de Armas, J.; Masip, D.; Juan, A.A. Learnheuristics: Hybridizing metaheuristics with machine learning for optimization
with dynamic inputs. Open Math. 2017, 15, 261–280. [CrossRef]
58. Zhang, J.; Huang, Y.; Wang, Y.; Ma, G. Multi-objective optimization of concrete mixture proportions using machine learning and
metaheuristic algorithms. Constr. Build. Mater. 2020, 253, 119208. [CrossRef]
59. Sabar, N.R.; Yi, X.; Song, A. A bi-objective hyper-heuristic support vector machines for big data cyber-security. IEEE Access 2018,
6, 10421–10431. [CrossRef]
60. Haghnegahdar, L.; Wang, Y. A whale optimization algorithm-trained artificial neural network for smart grid cyber intrusion
detection. Neural Comput. Appl. 2020, 32, 9427–9441. [CrossRef]
61. Ibor, A.E.; Oladeji, F.A.; Okunoye, O.B. A survey of cyber security approaches for attack detection prediction and prevention. Int.
J. Secur. Its Appl. 2018, 12, 15–28. [CrossRef]
62. Balyan, A.K.; Ahuja, S.; Lilhore, U.K.; Sharma, S.K.; Manoharan, P.; Algarni, A.D.; Elmannai, H.; Raahemifar, K. A hybrid
intrusion detection model using ega-pso and improved random forest method. Sensors 2022, 22, 5986. [CrossRef]
63. Hosseini, S.; Zade, B.M.H. New hybrid method for attack detection using combination of evolutionary algorithms, SVM, and
ANN. Comput. Netw. 2020, 173, 107168. [CrossRef]
64. Elnour, M.; Meskin, N.; Khan, K.M. Hybrid attack detection framework for industrial control systems using 1D-convolutional
neural network and isolation forest. In Proceedings of the 2020 IEEE Conference on Control Technology and Applications (CCTA),
Montreal, QC, Canada, 24–26 August 2020; IEEE: Piscataway, NJ, USA, 2020.
65. Alhaidari, F.A.; Al-Dahasi, E.M. New approach to determine DDoS attack patterns on SCADA system using machine learning. In
Proceedings of the 2019 International conference on computer and information sciences (ICCIS), Sakaka, Saudi Arabia, 3–4 April
2019; IEEE: Piscataway, NJ, USA, 2019.
66. Puthal, D.; Wilson, S.; Nanda, A.; Liu, M.; Swain, S.; Sahoo, B.P.S.; Yelamarthi, K.; Pillai, P.; El-Sayed, H.; Prasad, M. Decision tree
based user-centric security solution for critical IoT infrastructure. Comput. Electr. Eng. 2022, 99, 107754. [CrossRef]
67. Ragab, M.; Altalbe, A. A Blockchain-based architecture for enabling cybersecurity in the internet-of-critical infrastructures.
Comput. Mater. Contin. 2022, 72, 1579–1592. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.