VISVESVARAYA TECHNOLOGICAL UNIVERSITY
Jnana Sangama, Machhe, Belagavi, Karnataka 590018
Project Report
on
“CYBERSHIELD AI”
Submitted in partial fulfillment of the requirement
for the award of the degree of
Bachelor of Engineering
in
Artificial Intelligence & Machine Learning
by
Naina Nimisha (1BG20AI053)
Siddhant Priyadarshi (1BG20AI087)
Subrat Pandey(1BG20AI089)
Susanna John(1BG20AI092)
Under the Guidance Of
Dr. Tejaswini R Murgod
Associate Professor, Dept. of AIML
B.N.M. Institute of Technology
An Autonomous Institution under VTU, Approved by AICTE
Department of Artificial Intelligence & Machine Learning
2023 – 2024
An Autonomous Institution under VTU, Approved by AICTE
DEPARTMENT OF ARTIFICIAL INTELLIGENCE & MACHINE LEARNING
CERTIFICATE
Certified that the project work entitled CYBERSHIELD AI is carried out by Naina
Nimisha (1BG20AI053), Siddhant Priyadarshi (1BG20AI087), Subrat
Pandey(1BG20AI089), Susanna John(1BG20AI092), the bonafide students of B.N.M
Institute of Technology in partial fulfillment for the award of Bachelor of Engineering in
Artificial Intelligence & Machine Leaning of the Visvesvaraya Technological
University, Belagavi during the year 2023-2024. It is certified that all corrections /
suggestions indicated for Internal Assessment have been incorporated in the report deposited
in the departmental library. The Project report has been approved as it satisfies the academic
requirements in respect of Project work prescribed for the said Degree.
Dr. Tejaswini R Murgod Dr. Sheba Selvam Dr. Krishnamurthy G N
Associate Professor, Prof. & Head, Principal,
Dept. of AIML Dept. of AIML, BNMIT
BNMIT BNMIT
Name of the Examiners Signature with Date
1.
2.
ACKNOWLEDGEMENT
We consider it a privilege to express through the pages of this report, a few words of
gratitude to all those distinguished personalities who guided and inspired us in the
completion of this project work.
We would like to thank Shri. Narayan Rao R Maanay, Secretary, BNMIT, Bengaluru for
providing an excellent academic environment in college.
We would like to thank Prof. T.J. Rama Murthy, Director, BNMIT, Bengaluru for having
extended his support and encouragement during the course of work.
We would like to thank Dr. S.Y. Kulkarni, Additional Director, BNMIT, Bengaluru for his
extended support and encouragement during the course of work.
We would like to express my gratitude to Prof. Eishwar N Maanay, Dean, BNMIT,
Bengaluru for his relentless support, guidance, and encouragement.
We would like to thank Dr. Krishnamurthy G.N., Principal, BNMIT, Bengaluru for his
constant encouragement.
We would like to thank Dr. Sheba Selvam, Professor and Head of the Department of
Artificial Intelligence & Machine Learning, BNMIT, Bengaluru, for her support and
encouragement towards the completion of our project work.
We would like to express gratitude to our guide Dr. Tejaswini R Murgod, Associate
Professor, Department of Artificial Intelligence & Machine Learning, BNMIT, Bengaluru,
who has given us all the support and guidance in completing our project work successfully.
Naina Nimisha
1BG20AI053
Siddhant Priyadarshi
1BG20AI087
Subrat Pandey
1BG20AI089
Susanna John
1BG20AI093
i
ABSTRACT
The CyberShield AI project represents a groundbreaking endeavor in cybersecurity,
harnessing advanced machine learning and deep learning methodologies to tackle the
pervasive threats of phishing and spam. Through meticulous design and implementation, the
project introduces two pivotal modules aimed at fortifying defenses against these malicious
activities. The URL phishing detector module serves as a proactive barrier against deceptive
URLs, repositories, and IP addresses. Leveraging sophisticated feature extraction techniques
and a diverse range of classifiers, including Random Forest and SVM, the module
distinguishes between legitimate and phishing URLs with exceptional accuracy.
Furthermore, the integration of web scraping capabilities enhances the system's contextual
understanding, augmenting its ability to identify potential threats. Complementing this, the
spam email detector module employs Natural Language Processing (NLP) for
comprehensive email analysis. By extracting features such as common spam keywords and
attachment types, the module discerns between genuine emails and malicious spam. Multiple
classifiers, including SVM and Random Forest, trained on extensive datasets, ensure robust
performance in detecting and neutralizing spam threats. Central to the project's success is its
seamless integration with MongoDB, facilitating efficient storage and retrieval of critical
data. This structured approach enables iterative refinement of threat detection capabilities,
ensuring the system remains adaptive and responsive to evolving cyber threats. Moreover,
the project prioritizes user experience through the development of an intuitive frontend
interface powered by the Streamlit library. This interface empowers users to interact with the
system effortlessly, providing actionable insights based on risk assessment and fostering
informed decision-making.
ii
TABLE OF CONTENTS
Chapter Page
Title
No. No.
Acknowledgment i
Abstract ii
Table of Contents iii
List of Figure v
List of Tables vi
1 Introduction 1
1.1 Motivation 2
1.2 Problem Statement 3
1.3 Objective 3
1.4 Use Case 4
1.5 Summary 5
2 Literature Survey 6
2.1 Overview 7
2.2 Literature Survey 8
2.3 Approach towards the problem 11
2.4 Summary 13
3 System Requirement Specification and Cost 14
Estimation
3.1 Hardware Requirements 15
3.2 Software Requirements 15
3.3 Functional Requirements 16
3.4 Non-Functional Requirements 16
3.5 Summary 17
iii
4 System Design and Development 18
4.1 Architectural Design 19
4.2 Dataflow Diagram 21
4.3 UML Diagrams 21
4.4 Algorithm 22
5 Implementation 25
5.1 List of Modules 26
5.2 Dataset 26
5.3 Module Description 27
6 Testing and Validation 33
6.1 Testing Methods 34
6.2 Test Cases 34
7 Results and Discussion 37
7.1 Model Performance 38
7.2 Summary 41
8 Conclusion and Scope for Future Enhancements 43
8.1 Conclusion 44
8.2 Future Work 44
Appendix I 45
References 46
iv
List of Figures
Chapter No. Figure No. Description Page No.
4 Fig 4.1 Architectural Diagram 19
Fig 4.2 Dataflow diagram 21
Fig 4.3.1 Use Case Diagram 21
Fig 4.3.2 Sequence Diagram 22
5 Fig 5.2.1 Dataset for URL 26
Fig 5.2.2 Dataset for spam email 27
Fig 5.3.1 Code for Feature extractor 28
Fig 5.3.2 Working of Feature extractor 28
Fig 5.3.3 Code for model 29
Fig 5.3.4 Working of model 29
Fig 5.3.5 Code for scrapeit 30
Fig 5.3.6 Code for frontend 30
Fig 5.3.7 Working of frontend 30
Fig 5.3.8 Code for MongoDB utils 31
Fig 5.3.9 Code for email features 31
Fig 5.3.10 Code for tokenizer 32
Fig 5.3.11 Code for Email model 32
7 Fig 7.1.1 Training of model 38
Fig 7.1.2 Confusion Matrix 39
Fig 7.1.3 Training and validation loss 39
Fig 7.1.4 Training and validation accuracy 40
Fig 7.1.5 Saved URL in dataset 40
Fig 7.1.6 Frontend for URL 41
Fig 7.1.7 Spam Email detection 41
v
List of Tables
Chapter No. Table No. Description Page No.
7 Table 7.1.1 Accuracy of different models 38
vi
CHAPTER 1
INTRODUCTION
Chapter 1
INTRODUCTION
The overarching aim of this project is to architect an intelligent system, underpinned
by advanced machine learning algorithms, tailored for the proactive detection and
mitigation of sophisticated phishing domains. In an era where deceptive cyber threats
continually evolve, our initiative seeks to bolster cybersecurity frameworks,
reinforcing digital infrastructures against the insidious onslaught of phishing attacks
that mimic authentic domains. Central to our endeavor is the enhancement of online
security paradigms, fostering a resilient digital milieu that safeguards both individual
internet users and enterprises.
Delving into the scope, our anti-phishing system encompasses an expansive array of
facets, spanning from email and mobile security to real-time monitoring and behavior
analysis. Beyond mere detection, our multifaceted approach pivots on user education,
data protection, and regulatory compliance, harmoniously integrating with the
broader cybersecurity ecosystem. Prioritizing user-centricity, we are committed to
cultivating a user-friendly interface for administrators, thereby synergizing
operational efficiency with brand reputation protection. Ultimately, this project
aspires to sculpt a fortified defense mechanism against deceptive phishing endeavors,
championing a safer and more secure digital landscape for all stakeholders.
1.1 Motivation
The escalating sophistication of cyber threats, particularly in the domain of phishing,
underscores the urgent need for innovative defensive strategies. While current anti-
phishing systems have made notable strides, they remain encumbered by intrinsic
limitations, compromising their efficacy in combating evolving threats. This project
is driven by a compelling motivation to transcend these constraints and usher in a
paradigm shift in anti- phishing defense mechanisms. Central to our initiative is the
optimization of feature vector selection—a cornerstone that critically shapes the
accuracy of detection algorithms. The project endeavors to pioneer advanced feature
BE/Dept. of AIML/BNMIT Page 2 2023-2024
Cyber Shield AI
engineering strategies, enabling more refined and precise detection capabilities. A
pivotal focus area lies in mitigating the challenges posed by zero-day attacks,
sophisticated exploits that elude conventional detection methodologies.
Addressing the prevalent issue of false positives, our project aims to recalibrate
detection parameters through meticulous algorithmic refinement and empirical
analyses. By striking a harmonious balance between detection sensitivity and
specificity, we seek to instill greater user confidence while alleviating operational
burdens on cybersecurity frameworks. Furthermore, our endeavor emphasizes the
optimization of computational resources, advocating for efficient algorithmic designs
and parallel computing paradigms. By enhancing operational efficiency without
compromising detection accuracy, we aspire to forge a robust and responsive defense
infrastructure capable of safeguarding users on a global scale.
1.2 Problem Statement
In today's digital landscape, the escalating menace of phishing attacks underscores
the imperative for a robust anti-phishing detection system. Crafting such a system is
intricate, necessitating a holistic analysis of data drawn from a plethora of sources.
The integration of advanced machine learning algorithms—including SVM, Random
Forest, and Naive Bayes— aims to meticulously classify and promptly flag potential
phishing endeavors. Paramount to this endeavor is instilling confidence in these
classifications, underpinned by real-time monitoring capabilities. Furthermore, the
incorporation of a user-friendly interface is pivotal, ensuring heightened threat
awareness and empowering users to actively participate in the defense against
phishing. Ultimately, the overarching objective is to architect a resilient solution that
not only detects but decisively thwarts phishing activities, fortifying user safety
in the ever-evolving digital realm.
1.3 Objectives
Leverage machine learning and deep learning methodologies to create a
robust system capable of accurately classifying phishing instances in real-
BE/Dept. of AIML/BNMIT Page 3 2023-2024
Cyber Shield AI
time. This system should continuously evolve to confront emerging threats
and adapt to new patterns.
Utilize Natural Language Processing (NLP) and behavioural analysis
techniques for a thorough examination of both static and dynamic web
content. This comprehensive approach will aid in anomaly detection,
identifying deviations from expected behaviour that might indicate potential
phishing attempts.
Gather diverse data sources and integrate real-time threat intelligence to
continually update and refine the detection models. This ensures that the
system remains adaptive and responsive to the evolving landscape of phishing
attacks.
Create an intuitive interface for alerts, reporting, and user engagement. This
interface should facilitate easy reporting of suspected phishing attempts
by users, contributing to model refinement and ensuring an inclusive
approach to system improvement.
Ensure seamless integration across various platforms, including browsers,
applications, and email clients. The system should be scalable to
accommodate increasing data volumes and user interactions without
compromising performance.
Generate comprehensive reports that aid in compliance adherence and
incident response. These reports should include tracked threat detections,
system performance metrics, and actionable insights for cybersecurity
professionals.
1.4 Use Case
Cybersecurity Firms: Enhance client protection by integrating the
anti- phishing solutions into their products.
Financial Institutions: Prevent fraud and secure accounts and
transactions against phishing attacks.
E-commerce Platforms: Safeguard users from scams and fraudulent listings.
Email Services: Automatically filter out phishing emails to shield users.
BE/Dept. of AIML/BNMIT Page 4 2023-2024
Cyber Shield AI
Government Agencies and Educational Institutions: Enhance security
and protect sensitive information.
Social Media Networks: Block phishing attempts to safeguard user accounts.
Web Hosting Providers and Telecommunication Companies: Offer phishing
detection services to enhance security for their users.
Online Service Providers Across Industries: Maintain platform integrity and
trustworthiness with robust anti-phishing solutions.
These objectives highlight the comprehensive scope and diverse applications
of the proposed system, illustrating its potential impact across various sectors
in combatting phishing threats.
1.5 Summary
The primary objective of this project is to develop and deploy a sophisticated machine
learning-driven system dedicated to detecting and neutralizing phishing domains.
These domains, which adeptly mimic genuine platforms, pose significant threats in
the digital landscape. By harnessing advanced machine learning algorithms, the
project aims to augment existing cybersecurity measures, enabling proactive defenses
against evolving phishing tactics. The scope of the anti-phishing system is
comprehensive, encompassing various dimensions of cybersecurity, including
phishing detection, email and mobile security, web security, user education, real-time
monitoring, behavior analysis, data protection, and fraudulent content blocking.
Through continuous research and development, regulatory compliance, and tailored
solutions, the project aims to safeguard organizational assets and stakeholders while
integrating seamlessly with existing security ecosystems.
Central to the project's ethos is a user-centric approach, prioritizing intuitive interface
design and optimal protection against deceptive phishing attacks. By implementing
rigorous security protocols and monitoring mechanisms, the system aims to mitigate
risks associated with phishing attacks, safeguarding organizational reputations and
stakeholder trust. Ultimately, the project aspires to cultivate a safer online
environment by amalgamating advanced technologies, comprehensive security
protocols, and user-centric design principles. Through these endeavors, it seeks to
champion the cause of online security, fortifying the digital ecosystem against the
BE/Dept. of AIML/BNMIT Page 5 2023-2024
Cyber Shield AI
pervasive threat of phishing activities. The central mission of this project revolves
around the development and deployment of a sophisticated, machine learning-driven
system dedicated to the detection and neutralization of phishing domains. These
domains adeptly mimic the characteristics, identity, and user experiences of genuine
platforms, making them particularly insidious threats in the digital landscape.
BE/Dept. of AIML/BNMIT Page 6 2023-2024
CHAPTER 2
LITERATURE SURVEY
Chapter 2
LITERATURE SURVEY
2.1 Overview
The internet is an indispensable facet of modern daily life, serving multifaceted
purposes and intricately linking computers through diverse telecommunication
channels. Its pervasive nature provides unparalleled connectivity but also exposes
users to an increasingly prevalent threat—phishing. This form of online fraud
cleverly amalgamates social manipulation with technical deceit, cunningly tricking
internet users into divulging their confidential data or crucial online information.
Phishing tactics encompass a broad spectrum of social engineering strategies,
ranging from the deceptive facade of spoofed emails and counterfeit websites to the
insidious guise of dubious online advertisements, fake SMS messages, and the highly
targeted approach known as spear phishing.
These nefarious tactics don't discriminate; they target entities spanning from major
corporations and financial institutions to payment companies, military
establishments, and governmental agencies. Unfortunately, the aftermath of
successful phishing attacks often results in substantial financial losses and severe
damage to the affected entity's brand credibility.
Recognizing the gravity of this threat, the development and deployment of anti-
phishing systems have become imperative. These systems act as critical safeguards,
not only protecting the sensitive information of individuals and organizations but also
serving as a crucial mitigation strategy against financial losses. Moreover, legal
mandates in numerous regions necessitate that organizations protect user data,
heightening the significance of effective anti-phishing measures for regulatory
compliance. Given the relentless evolution of cyber threats, these systems must
continually adapt and innovate to counter new attack methods, including the ever-
BE/Dept. of AIML/BNMIT Page 8 2023-2024
Cyber Shield AI
looming danger posed by zero- day threats. Thus, the pursuit of robust anti-phishing
measures remains a pressing concern in ensuring the integrity and security of online
interactions for both individuals and institutions.
2.2 Literature Survey
Detecting Malicious URLs Using Machine Learning Techniques: Review
and Research Directions [1] - This project revolves around the identification
of malicious URLs employing a spectrum of machine learning models,
including Linear Regression, Support Vector Machine, Naïve Bayes, Random
Forests, DNN, and Hashing Vectorizer. The research, led by Malak Aljabri
and Hanan S. Altamimi along with their team, focuses on spam, malware, and
phishing detection. Achieving an impressive 98.82% accuracy in their
detection methodologies, the project highlights the challenge posed by the
scarcity of available datasets in this domain, emphasizing the necessity for
expanded resources to further enhance research and development in
malicious URL identification.
Phishing Website Analysis and Detection Using Machine Learning [2] -
This project delves into identifying common attributes among phishing
websites, constructing a model using classifiers like Random Forest and
Decision Tree. The study achieved an accuracy of 97.73%. However,
challenges remain in interpreting models, handling imbalanced datasets, and
ensuring real-world scalability and robustness of the proposed application for
detecting phishing websites.
Comparative Analysis of Intrusion Detection Systems and Machine
Learning- Based Model Analysis Through Decision Tree [3] - Focusing
on various Intrusion Detection Systems (IDS) - Network-Based (NIDS),
Signature-Based (SIDS), and Anomaly-Based (AIDS) - this project employs
a range of models like RNN, DNN, SVM, and Decision Trees. Achieving
93% accuracy, the project faces challenges in handling issues like flooding
and obfuscation.
BE/Dept. of AIML/BNMIT Page 9 2023-2024
Cyber Shield AI
Detecting Phishing Domains Using Machine Learning [4] - Employing
techniques like cross-validation, boosting, and stacking, this project utilizes
models such as Decision Trees, SVM, Random Forests, and ANN to detect
phishing domains with an accuracy of 97.4%. However, it heavily relies on
user observation, posing a limitation.
A URL-Based Social Semantic Attacks Detection With Character-Aware
Language Model [5] -Employing Bidirectional Encryption and NLP
techniques like BERT, LSTM, CNN, and DNN, this project achieved
95.47% accuracy in detecting social semantic attacks. Yet, scalability
remains a concern, posing a challenge to its widespread application.
A Predictive Model for Phishing Detection [6] - Using a hybrid LSD with
Canopy feature selection, this project explores models like GBT, Naïve Bayes,
Linear Regression, Decision Tree, SVM, and Random Forest, achieving an
accuracy of 98.12%. However, underfitting poses a challenge to its
performance.
Intelligent Fraud Detection in Financial Statements Using Machine
Learning and Data Mining: A Systematic Literature Review [7] - This
project emphasizes the significance of appropriate feature selection in financial
fraud detection. While achieving 95.23% accuracy, it lacks exploration of
unsupervised learning methods like clustering and tends to focus more on
structured data rather than unstructured textual or audio data.
An intelligent cyber security phishing detection system using deep
learning techniques [8] - Introducing an intelligent phishing detection system
via deep learning techniques, this project employs various models like locally-
deep SVM, boosted decision tree, logistic regression, neural networks, and
decision forests. While achieving high accuracies, it acknowledges the need for
improved feature selection techniques to counter evolving phishing techniques
over time.
BE/Dept. of AIML/BNMIT Page 10 2023-2024
Cyber Shield AI
Robust Ensemble Machine Learning Model for Filtering Phishing URLs:
Expandable Random Gradient Stacked Voting Classifier (ERG-SVC) [9]
- This project implements the Expandable Random Gradient Stacked Voting
Classifier (ERG-SVC) for filtering phishing URLs. Extracting features like
page-rank, IP, redirecting, domain age, and URL length, the model achieves an
accuracy of 98.118%. However, it relies on third-party attributes for URL
feature extraction, presents a complex architecture, and might incur high cloud
deployment costs.
Analysis of Data Engineering for Fraud Detection Using Machine
Learning and Artificial Intelligence Technologies [10] - Focusing on fraud
detection, this project emphasizes feature and instance engineering. Employing
methods like Gradient Boosted Trees, SMOTE, ADASYN, MWMOTE, and
ROSE, the project attains a 96.2% accuracy rate. Challenges include potential
scalability issues and real-world implementation hurdles.
Phishing Detection using Machine Learning Algorithm [11] - Utilizing
techniques such as PCA, recursive feature selection, DNN, Machine Learning,
and KNN, this project achieves a 95.89% accuracy rate in phishing detection.
However, it faces a limitation due to the scarcity of validation data for the
model.
A Deep Learning-Based Framework for Phishing Website Detection [12]
- Employing LSTM, CNN, Random Forests, and Logistic Regression, this
project achieves a remarkable 99.18% accuracy rate. However, it grapples with
overfitting issues, impacting its generalizability.
Phishing URL Classification Analysis Using ANN Algorithm [13] -
Focusing on URL classification, this project employs Artificial Neural
Networks and achieves an accuracy of 98.72%. Challenges include decreased
accuracy with epoch numbers and the absence of NLP or GUI integration.
Phishing Detection Using Machine Learning Techniques [14] - Using IP
address, URL length, and various boosting algorithms, this project achieves a
92.65% accuracy rate. However, it notes the poor performance of the
AdaBoost algorithm on noisy data due to its slow learning rate.
BE/Dept. of AIML/BNMIT Page 11 2023-2024
Cyber Shield AI
An intelligent cyber security phishing detection system using deep
learning techniques [15] - The project proposes a machine learning-based
phishing detection model using various algorithms. Achieving accuracy scores
of 0.88, 1.00, and 0.97 (consecutively) for boosted decision tree models, it
emphasizes the need for improved feature selection techniques to counter
evolving phishing techniques over time.
Machine learning based phishing detection from URLs [16] - This project
focuses on phishing detection from URLs using seven different classification
algorithms and NLP-based features. Achieving 97.98% accuracy with Random
Forest using NLP features, it notes the absence of a universally acceptable test
set and suggests the need for certain parallel processing techniques and
subsystem enhancements.
Phishing Detection Using Machine Learning Techniques [17] - Utilizing
various features and machine learning techniques, this project achieves a
92.65% accuracy rate. However, it highlights the performance issue of the
AdaBoost algorithm on noisy data due to its slow learning rate.
Detection of Phishing Websites using an Efficient Machine Learning Framework
[18] - Employing various attributes for feature extraction, this project achieves
a phishing detection accuracy of 91.4% with the RF classifier. However, the
paper does not identify or discuss any gaps in their approach.
Phishing Attack Detection Using Deep Learning Approach [19] - Utilizing deep
learning techniques like ANN, CNN, this project achieves a 95% accuracy rate.
However, it notes a lower detection rate for suspicious websites as a limitation.
Phishing Attacks Detection using Machine Learning Approach [20] -
Employing different machine learning algorithms, this project achieves an
accuracy of 91.94% with Decision Trees and 96.96% with Random Forests.
However, the study faces issues related to overfitting.
Countering Malicious URLs in Internet of Things Using a Knowledge-
Based Approach and a Simulated Expert [21] - This project employs various
methods and algorithms achieving a high accuracy of 99.8%. However, it
notes the lack of clarity in the evaluation measure's accuracy based on
empirical values.
BE/Dept. of AIML/BNMIT Page 12 2023-2024
Cyber Shield AI
Phishing Detection Based on Machine Learning and Feature Selection
Methods [22] - This project emphasizes feature selection and Multilayer
Perception for phishing detection. Achieving a 97.4% accuracy rate, the study
notes the metrics used were outdated, potentially indicating a need for more
current evaluation methods or performance metrics.
2.3 Approach Towards the Problem
The proposed anti-phishing system represents a comprehensive and adaptive
approach to combating the multifaceted challenges posed by phishing activities in the
digital landscape. Rooted in advanced machine learning techniques and enriched by
a diverse array of data sources, the system embodies a holistic paradigm that
transcends traditional boundaries of detection and response.
Data Compilation and Feature Extraction: At the nucleus of our system lies a
robust data compilation mechanism, adeptly synthesizing information from an
eclectic mix of sources— including web URLs, content repositories, metadata, and
user submissions. This expansive data reservoir undergoes meticulous scrutiny,
wherein both visual and textual dimensions are rigorously analyzed. Key features,
ranging from HTML content and image properties to textual elements and structural
attributes, are meticulously extracted, laying the foundation for subsequent analysis
and classification.
Machine Learning Integration: Leveraging a sophisticated ensemble of machine
learning algorithms—including Support Vector Machines (SVM), Random
Forest, and NaiveBayes— the system embarks on a transformative journey of model
training and refinement. By assimilating labeled datasets that encapsulate the intricate
distinctions between legitimate and phishing data, the algorithms are primed to
discern subtle patterns and anomalies indicative of deceptive activities. Upon
encountering incoming data streams, the models meticulously evaluate the extracted
features, promptly flagging instances that manifest known phishing attributes.
Scoring Mechanism and Real-time Monitoring: Augmenting the classification
BE/Dept. of AIML/BNMIT Page 13 2023-2024
Cyber Shield AI
process is a dynamic scoring system, characterized by predefined thresholds that serve
as bulwarks against false positives. This nuanced scoring paradigm not only enhances
the precision of classifications but also instills confidence in the detection outcomes.
Operating in real- time, the system perpetually scans data streams, vigilantly
identifying phishing traits and precipitating immediate alerts or access restrictions.
User Interface and Reporting Mechanisms: Complementing the analytical prowess
of the system is an intuitive user interface, designed to foster enhanced threat
awareness and user engagement. Through this interface, users are empowered to
report suspicious domains, false positives, or anomalous activities, thereby
enriching the model's reservoir of insights and facilitating iterative refinement.
Advanced Features and Integration: The system's analytical ambit transcends static
web content, encompassing dynamically generated content and intricate textual
nuances. Integration of advanced machine learning architectures—such as
Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs)—
amplifies the accuracy and granularity of domain classifications. A salient addition to
the system's repertoire is behavioral analysis, enabling the detection of anomalous
user interactions with web pages—a pivotal dimension in contemporary
cybersecurity paradigms.
Data Diversity and Scalability: Emphasizing the importance of comprehensive
training and testing, the system amalgamates data from a myriad of sources, including
real-world datasets, DNS logs, and threat intelligence feeds. This expansive data
assimilation strategy ensures robust model adaptability and resilience against
emerging threats. Furthermore, the system's platform-agnostic architecture—
supporting web browsers, mobile applications, and email clients—ensures
ubiquitous protection across diverse digital interfaces. Scalability remains a
cornerstone of the system's design ethos, adeptly accommodating varying
workloads while maintaining optimal performance metrics.
Reporting and Compliance: Rounding off the system's multifaceted capabilities are
BE/Dept. of AIML/BNMIT Page 14 2023-2024
Cyber Shield AI
detailed reporting mechanisms, providing stakeholders with comprehensive insights
into detected threats, system performance, and compliance adherence. These reports
not only facilitate informed decision-making but also streamline incident response
protocols, fostering a proactive cybersecurity posture.
In essence, the proposed anti-phishing system epitomizes a confluence of innovation,
scalability, and adaptability, poised to redefine the contours of online security paradigms.
Through its holistic methodology, the system aspires to cultivate a safer, more secure
digital ecosystem, safeguarding users and organizations from the pervasive threats of
phishing activities.
2.4 Summary
The landscape of cybersecurity has seen numerous anti-phishing systems emerge, each
employing distinct methodologies to combat evolving phishing attacks. Existing
literature highlights common limitations such as reliance on static analysis techniques,
heuristic-driven approaches, and limited integration with advanced Natural Language
Processing (NLP) and behavioral analysis. These shortcomings underscore the need for
innovative solutions capable of transcending these constraints.
In response to these challenges, the proposed anti-phishing system offers a holistic and
adaptive paradigm designed to address the limitations of existing methodologies
comprehensively. Leveraging advanced machine learning algorithms, including
Support Vector Machines (SVM), Random Forest, and Naïve Bayes, enhanced by
Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), the
system synthesizes information from diverse sources for multidimensional analysis.
Real-time monitoring, dynamic scoring, and advanced NLP techniques empower the
system to decipher linguistic cues, detect anomalies, and mitigate false positives and
negatives, thereby fostering a safer digital ecosystem.
BE/Dept. of AIML/BNMIT Page 15 2023-2024
CHAPTER 3
SYSTEM REQUIREMENT SPECIFICATION AND COST
ESTIMATION
Chapter 3
SYSTEM REQUIREMENT SPECIFICATION AND
COST ESTIMATION
3.1 Hardware Requirements
The CyberShield AI system operates efficiently on standard hardware configurations:
Processor: Intel Core i5 or equivalent AMD processor (or higher) for optimal
performance during model training and inference.
RAM: Minimum 8GB of RAM, recommended 16GB or higher for efficient
processing of large datasets and complex machine learning models.
Storage: At least 500GB of storage space for storing datasets, trained models,
and system logs.
Network Interface: Stable internet connection for web scraping, accessing
external threat intelligence feeds, and system updates.
Graphics Processing Unit (GPU): Optional but recommended for accelerating
deep learning model training. NVIDIA GeForce GTX 1060 or higher is
preferable for GPU acceleration.
3.2 Software Requirements
CyberShield AI relies on software tools and libraries to facilitate its diverse functionalities:
Operating System: Compatible with Windows, macOS, or Linux distributions.
Python: Latest version of Python 3.x installed, along with package
management tools like pip or conda.
Python Libraries: Installation of essential Python libraries such as Scikit-learn,
TensorFlow, Keras, Pandas, NumPy, Scrapy, and Streamlit for machine
learning, web scraping, and frontend development.
MongoDB: Installation and configuration of MongoDB database for efficient
storage and retrieval of data generated by the system.
BE/Dept. of AIML/BNMIT Page 16 2023-2024
Cyber Shield AI
Integrated Development Environment (IDE): Optional but recommended
IDEs include PyCharm, VSCode for streamlined development and
debugging.
Web Browser: Compatible web browser for accessing the frontend
interface developed using Streamlit.
External Libraries: Depending on specific requirements, additional
libraries for NLP (Natural Language Processing) tasks, web scraping, and
data visualization.
3.3 Functional Requirements
URL Phishing Detection: The ability to analyze URLs, repositories, and IP
addresses for phishing indicators and feature extraction to identify patterns like
domain name, length, and redirection count. Implementation of machine
learning models (e.g., Random Forest, SVM) for classification and integration
of web scraping to gather contextual information from websites.
Email Spam Detection: Capability to analyze email content, attachments, and
headers for spam indicators and feature extraction for sentiment analysis,
keyword recognition, and attachment scrutiny. The utilization of machine
learning models (e.g., Naïve Bayes, SVM) for email classification and
detection of suspicious sender addresses and invalid domains.
MongoDB Integration: Connection establishment and data insertion into
MongoDB database along with organization of data into structured collections
for efficient storage and retrieval. Compatibility with MongoDB's query
language for data manipulation and retrieval.
Frontend Interface: Development of an interactive user interface using the
Streamlit library and functionality to input URLs, email content, and view
classification results. The ability to select between URL phishing detection and
email spam detection tasks and presentation of actionable insights and risk
assessment based on classification results.
BE/Dept. of AIML/BNMIT Page 17 2023-2024
Cyber Shield AI
3.4 Non-Functional Requirements
Performance: Efficient utilization of system resources to ensure fast
processing and response times. Scalability to handle increasing data volumes
and user demand without performance degradation.
Security: Implementation of secure data transmission protocols for handling
sensitive information. Protection against unauthorized access to the system
and its data repositories.
Reliability: High availability of the system to ensure continuous operation and
accessibility. Fault tolerance mechanisms to mitigate the impact of system
failures or errors.
Usability: Intuitive user interface design for ease of use and navigation.
Support for multiple languages and accessibility features to accommodate
diverse user needs.
Maintainability: Modular and well-structured codebase for ease of
maintenance and future enhancements. Documentation of code, system
architecture, and deployment procedures for knowledge transfer and
troubleshooting.
Compatibility: Compatibility with various operating systems, web browsers,
and hardware configurations. Support for integration with external APIs,
libraries, and services for extended functionality.
3.5 Summary
The CyberShield AI project necessitates hardware and software components capable
of supporting its sophisticated cybersecurity functionalities. On the hardware side, a
robust system with at least an Intel Core i5 processor, 8GB of RAM (preferably 16GB
or more), and ample storage space is essential. Additionally, a stable internet
connection and optional GPU for accelerated deep learning tasks are beneficial.
Software requirements include Python 3.x with essential libraries like Scikit-learn and
TensorFlow, MongoDB for data storage, and Streamlit for frontend development. The
BE/Dept. of AIML/BNMIT Page 18 2023-2024
Cyber Shield AI
project's functional requirements encompass URL phishing and email spam detection
capabilities, including feature extraction, machine learning model implementation,
web scraping, and MongoDB integration. Non-functional requirements focus on
performance, security, reliability, usability, and maintainability. The system must
efficiently utilize resources, ensure secure data handling, maintain high availability,
provide an intuitive user interface, and feature well-structured, documented code for
ease of maintenance. Meeting these requirements ensures CyberShield AI's
effectiveness in combating cyber threats while providing a seamless user experience.
BE/Dept. of AIML/BNMIT Page 19 2023-2024
CHAPTER 4
SYSTEM DESIGN AND DEVELOPMENT
Chapter 4
SYSTEM DESIGN AND DEVELOPMENT
4.1 Architectural Design
Fig:4.1 Architectural Diagram
Figure 4.1 illustrates the architectural diagram of our system, which is elaborated upon
in the following explanation.
1. Data Collection:
Data is collected from two primary sources: website traffic logs and a web crawler
designed for monitoring websites. Website traffic logs provide valuable information
about user interactions with websites, including URLs accessed and timestamps. The
web crawler systematically traverses websites, extracting data such as URLs,
webpage content, and metadata.
This collected data is then processed and formatted into a CSV (Comma-Separated
Values) format suitable for training machine learning models.
BE/Dept. of AIML/BNMIT Page 21 2023-24
Cybershield AI
2. Model:
The collected data undergoes preprocessing to clean and prepare it for analysis. This may
involve tasks such as data normalization, handling missing values, and encoding
categorical variables.
Two models are employed for analysis: a Convolutional Neural Network (CNN) and
a Random Forest Classifier. These models are chosen for their ability to handle
different types of data and to complement each other's strengths.
The CNN is particularly adept at processing sequential data, such as the textual
content of URLs, while the Random Forest Classifier excels at handling tabular data,
such as features extracted from website traffic logs.
The models are trained using the preprocessed data to learn patterns indicative of
legitimate and phishing URLs. This training process involves iteratively adjusting
model parameters to minimize prediction errors and maximize accuracy.
3. MongoDB:
- The predictions made by the trained models are stored in MongoDB, a NoSQL
database known for its scalability and flexibility. MongoDB stores data in a document-
oriented format, allowing for efficient storage and retrieval of complex data
structures.
- Predictions are stored in a dictionary format within MongoDB, associating each
URL with its predicted classification (legitimate or phished). This allows for easy
querying and retrieval of predictions during runtime.
4. Output:
The final output of the process is a determination of whether each URL is flagged as
legitimate or phished.
This determination is presented in a dictionary format, indicating the classification of
each URL along with any additional relevant information.
The output can be further analyzed or integrated into a larger cybersecurity system
BE/Dept. of AIML/BNMIT Page 22 2023-24
Cyber Shield AI
Figure 4.3.1 presents a diagram outlining a cybersecurity AI system featuring two
primary modules: 'Detect Phishing URL' and 'Detect Spam Email.' The flow of
information, including user submission of URLs and emails, analysis based on features
and patterns, interaction with the database, and results displayed at the user interface,
is visually represented in a clear and professional layout.
Fig:4.3.2 Sequence Diagram for the system
Figure 4.3.2 depicts the sequence diagram of the system, illustrating the interactions
and flow of information between different components or modules. This diagram
provides a visual representation of how the system processes user inputs, performs
analysis or computations, interacts with databases or external services, and delivers
results back to the user interface. Overall, the sequence diagram offers a structured
overview of the system's functionality and the sequence of actions involved in
executing specific tasks or processes.
4.4 Algorithm
1. Data Collection:
Data is collected from two primary sources: website traffic logs and a web crawler
designed for monitoring websites. Website traffic logs provide valuable information
BE/Dept. of AIML/BNMIT Page 24 2023-24
Cyber Shield AI
about user interactions with websites, including URLs accessed and timestamps. The
web crawler systematically traverses websites, extracting data such as URLs,
webpage content, and metadata. This collected data is then processed and formatted
into a CSV (Comma- Separated Values) format suitable for training machine learning
models.
Steps:
Gather data from website traffic logs and a web crawler.
Process and format data into a CSV format.
2. Model Training:
The collected data undergoes preprocessing to clean and prepare it for analysis. This
may involve tasks such as data normalization, handling missing values, and encoding
categorical variables. Two models are employed for analysis: a Convolutional Neural
Network (CNN) and a Random Forest Classifier. These models are chosen for their
ability to handle different types of data and to complement each other's strengths. The
CNN is particularly adept at processing sequential data, such as the textual content of
URLs, while the Random Forest Classifier excels at handling tabular data, such as
features extracted from website traffic logs. The models are trained using the
preprocessed data to learn patterns indicative of legitimate and phishing URLs. This
training process involves iteratively adjusting model parameters to minimize
prediction errors and maximize accuracy with the following steps:
Preprocess collected data including normalization and handling missing values.
Employ CNN and Random Forest models for analysis.
Train models using preprocessed data.
3. Data Storage (MongoDB):
The predictions made by the trained models are stored in MongoDB, a NoSQL
database known for its scalability and flexibility. MongoDB stores data in a document-
oriented format, allowing for efficient storage and retrieval of complex data structures.
Predictions are stored in a dictionary format within MongoDB, associating each URL
with its predicted classification (legitimate or phished). This allows for easy querying
BE/Dept. of AIML/BNMIT Page 25 2023-24
Cyber Shield AI
and retrieval of predictions during runtime.
Steps:
• Store predictions made by trained models in MongoDB.
• Utilize MongoDB's document-oriented format for efficient storage and retrieval.
4. Output Generation:
Description: The final output of the process is a determination of whether each URL is flagged
as legitimate or phished. This determination is presented in a dictionary format, indicating the
classification of each URL along with any additional relevant information. The output can be
further analyzed or integrated into a larger cybersecurity system designed to protect users from
accessing malicious websites. Overall, this process represents a systematic approach to
identifying potential phishing attempts among URLs, leveraging machine learning models and
database technologies to enhance cybersecurity measures.
Steps:
Determine the legitimacy of URLs based on predictions stored in MongoDB.
Present output in a dictionary format indicating URL classification and relevant
information.
BE/Dept. of AIML/BNMIT Page 26 2023-24
CHAPTER 5
IMPLEMENTATION
Chapter 5
IMPLEMENTATION
5.1 List of Modules
• FeatureExtractor.py - Extracts features.
• Model.py - Implements machine learning models.
• Scrapeit.py - Handles web scraping tasks.
• Frontend.py - Manages the frontend interface.
• Mongodb.utils.py - Provides utilities for MongoDB.
• EmailFeatures.py - Extracts features from emails.
• Tokenizer.py - Tokenizes text data.
• EmailModel.py - Implements machine learning models for email analysis.
5.2 Dataset
Fig:5.2.1 Dataset for URL Phishing Classification
The dataset shown in Figure 5.2.1 is for URL analysis for cybersecurity purposes.
• URL Features: The dataset includes features like URL length, presence
of IP addresses or ‘@’ symbols, and the number of subdomains. These
can indicate whether a URL might be malicious.
• Domain Information: Attributes such as domain age and domain end
(expiration) provide insights into the legitimacy of a website.
BE/Dept. of AIML/BNMIT Page 26 2023-2024
Cyber Shield AI
• Content Analysis: Title and content size could be used to detect spam or
phishing attempts.
• Interaction Flags: Features like ‘Mouse_Over’ and ‘Right_Click’ may be
used to identify suspicious scripts or web elements designed to deceive users.
• Security Indicators: The presence of HTTPS and ‘Tiny_URL’ can signal
secure or potentially obfuscated links.
• Behavioral Metrics: ‘Num_Third_Party_Clicks’ and ‘Num_Popups’ might
help assess user interaction and potential exposure to threats.
• Final Evaluation: ‘Final_Val’ and ‘Result’ likely represent the outcome
of an analysis, determining if a URL is safe or dangerous.
This dataset is crucial for developing models that predict URL safety, helping
protect users from online threats. Each feature contributes to a comprehensive
evaluation of web security risks.
Fig:5.2.2 Dataset for Spam Mail Detection
An email dataset as shown in Figure 5.2.2 with labels and text describes a collection
of emails used for tasks like spam filtering, sentiment analysis, or email
classification. Here's a breakdown of what the dataset likely contains: The dataset
has two main columns:
• Label: This column categorizes each email. Labels can be binary (e.g.,
"spam" or "not spam") or have multiple categories (e.g., "work", "personal",
"promotion").
• Text: This column contains the actual content of the email, including the
body and potentially the subject line.
BE/Dept. of AIML/BNMIT Page 27 2023-2024
Cyber Shield AI
5.3 Module Description
1. FeatureExtractor.py- This module serves as the backbone of data
preprocessing, extracting a diverse array of features from URLs and email
content. It meticulously dissects URLs, capturing crucial attributes like
domain name, length, and redirection count, while also delving into email
bodies to discern sentiment, common spam keywords, and urgency cues. These
extracted features lay the groundwork for robust classification and analysis,
empowering subsequent modules with rich, context-aware data for effective threat
detection.
Fig:5.3.1 Code Snippet of FeatureExtractor.py
Fig:5.3.2 Working of FeatureExtractor
2. Model.py- Within Model.py resides the arsenal of machine learning
algorithms meticulously crafted for the task of cyber threat detection. From
Random Forest and Support Vector Machines to the sophistication of
Convolutional Neural Networks (CNN) and Recurrent Neural Networks
(RNN), this module encapsulates a diverse range of classifiers. Trained on
BE/Dept. of AIML/BNMIT Page 28 2023-2024
Cyber Shield AI
extensive datasets and fine-tuned using various neural network architectures,
these models exhibit superior performance in distinguishing between
legitimate and malicious entities, safeguarding users against phishing and
spam attacks.
Fig:5.3.3 Code Snippet of Model.py
Fig:5.3.4 Model Working Diagram
3. Scrapeit.py- Scrapeit.py embodies the proactive approach to cybersecurity,
employing the Scrapy library to traverse the digital landscape and scrutinize
website content. As the web crawler of choice, it meticulously inspects web
pages, extracting valuable insights that inform URL classification and
analysis. By dissecting the content of entire websites, it augments the system's
understanding of contextual nuances, enhancing the accuracy of threat
assessment and bolstering defenses against phishing attempts.
BE/Dept. of AIML/BNMIT Page 29 2023-2024
Cyber Shield AI
Fig:5.3.5 Code Snippet of Scrapeit.py
4. Frontend.py- As the gateway to user interaction, Frontend.py orchestrates a
seamless and intuitive interface powered by the Streamlit library. Through its
elegant design and user-friendly features, it empowers users to navigate the
complexities of cybersecurity effortlessly. From choosing tasks to interpreting
results, this module streamlines the user experience, providing actionable
insights and fostering informed decision-making in the face of potential
threats.
Fig:5.3.6 Code Snippet of Frontend.py
Fig:5.3.7 Working of Frontend
BE/Dept. of AIML/BNMIT Page 30 2023-2024
Cyber Shield AI
5. Mongodb.utils.py- Mongodb.utils.py acts as the conduit between the system
and the robust data storage capabilities of MongoDB. Equipped with utility
functions, it facilitates efficient data handling, enabling seamless storage and
retrieval of critical information generated by the system. By leveraging
MongoDB's scalability and flexibility, it ensures the integrity and accessibility
of data, laying the foundation for iterative refinement and enhancement of the
system's capabilities.
Fig:5.3.8 Code Snippet of MongodbUtils.py
6. EmailFeatures.py- EmailFeatures.py delves into the intricacies of email
communication, extracting a myriad of features essential for spam detection.
From analyzing attachment types and sender information to scrutinizing email
headers for anomalies, this module leaves no stone unturned in its quest to
identify potential threats. By leveraging sophisticated feature extraction
techniques, it equips the system with the contextual understanding necessary
to differentiate between legitimate emails and malicious, fortifying defenses
against deceptive cyber tactics.
Fig:5.3.9 Code Snippet of EmailFeatures.py
BE/Dept. of AIML/BNMIT Page 31 2023-2024
Cyber Shield AI
7. Tokenizer.py- At the heart of natural language processing (NLP) lies
Tokenizer.py, a module dedicated to breaking down textual data into its
constituent parts. Through tokenization, it dissects email content and URL
attributes, transforming raw text into structured data suitable for analysis. By
segmenting text into meaningful units, it lays the groundwork for sentiment
analysis, keyword extraction, and other NLP- based techniques, enriching the
system's understanding of linguistic nuances and enhancing its threat detection
capabilities.
Fig:5.3.10 Code Snippet of Tokenizer.py
8. EmailModel.py- EmailModel.py embodies the culmination of email-specific
threat detection, housing machine learning models trained on a diverse array
of email features. From sender analysis and header scrutiny to content
examination and attachment inspection, this module employs a holistic
approach to spam detection. Trained on meticulously labeled datasets and fine-
tuned using neural network architectures, these models exhibit superior
accuracy in discerning between legitimate emails and spam, safeguarding
users against deceptive cyber threats with unparalleled efficacy.users against
deceptive cyber threats with unparalleled efficacy.
Fig:5.3.11 Code Snippet of EmailModel.py
BE/Dept. of AIML/BNMIT Page 32 2023-2024
CHAPTER 6
TESTING AND VALIDATION
Chapter 6
TESTING AND VALIDATION
6.1 Testing Methods
1. Black Box Testing: Assess the phishing detection system without knowing
its internal structure or code. This simulates how a real user interacts with the
website.
2. White Box Testing: Examining the internal structure, logic, and code of the
website to ensure all aspects of the phishing detection algorithm are
functioning correctly.
3. Functional Testing: Verify that all functions of the website (e.g., URL
analysis, content scanning) work as intended to detect and prevent phishing
attacks.
4. Non-functional Testing: Evaluate performance (e.g., response time),
reliability, and security (e.g., data encryption) of the phishing detection
system.
5. Usability Testing: Assess the user interface for ease of understanding and
interaction, ensuring users can effectively utilize the phishing detection
features.
6. Compatibility Testing: Ensure the website functions correctly across
different browsers, operating systems, and devices to reach the widest
audience.
7. Security Testing: Check for vulnerabilities and weaknesses in the website's
security measures to prevent exploitation by attackers.
8. Scalability Testing: Verify the website's ability to handle increasing
amounts of data and traffic as the user base grows.
6.2 Test Cases
Test Case 1:
URL: https://twitter.com-free-coupons@da.gd/3B0a9
BE/Dept. of AIML/BNMIT Page 34 2023-2024
Cyber Shield AI
Expected Output: Phished
Obtained Output: Unknown
Test Case 2:
URL: http://www.merry-thought.com/media/.26-2020
Expected Output: Phished
Obtained Output: Legitimate
Test Case 3:
URL: https://www.google.com/
Expected Output: Legitimate
Obtained Output: Legitimate
Test Case 4:
URL:
https://659615825bf7f7ac661dedc98939fea2.serveo.net
Expected Output: Phished
Obtained Output: Phished
Test Case 5:
URL:
https://supportappleld.com.secureupdate.duilawyeryork.com/ap/89e6a3b4b0
63b8d/?cmd=_update&dispatch=89e6a3b4b063b8d1b&locale=_
Expected Output: Phished
Obtained Output: Phished
Test Case 6:
Email Address: "f tilbud@free-money-now.com"
BE/Dept. of AIML/BNMIT Page 35 2023-2024
Cyber Shield AI
Body: "You WON'T BELIEVE this offer! Click here to claim your FREE
iPhone!!! Limited time only!!! “
Attachment: iphone_gift_certificate.exe
Expected Output: Spam
Obtained Output: Spam
Test Case 7:
Email Address: “colleague@yourcompany.com”
Body: "Hi [Your Name], Just following up on the meeting yesterday. Attached
are the notes for your reference. Let me know if you have any questions. Best,
[Colleague Name]”
Attachment: "Meeting_Notes_2024-05-07.docx"
Expected Output: Not Spam
Obtained Output: Not Spam
Test Case 8:
Email Address: "noreply@seemingly_legitimate_company.com"
Body: "Dear Valued Customer, We'd like to inform you about exciting new
updates to our services! Click here to learn more and ensure you don't miss
out. Sincerely, The [Company Name] Team”
Attachment: None
Expected Output: Spam
Obtained Output: Not Spam
BE/Dept. of AIML/BNMIT Page 36 2023-2024
CHAPTER 7
RESULT AND DISCUSSION
Chapter 7
RESULTS AND DISCUSSION
7.1 Model Performance
Table 7.1.1 Accuracy of Different Trained Models
As depicted in Table 7.1.1, various models were employed to train our dataset,
with the CNN and RNN model combined with a stacked random forest yielding
the highest accuracy. This model configuration was selected for further analysis
and deployment based on its superior performance.
Fig:7.1.1 Training of the Model
BE/Dept. of AIML/BNMIT Page 38 2023-2024
Cyber Shield AI
Fig:7.1.2 Confusion Matrix of the Trained Model
Upon further training with an expanded dataset, Figure 7.1.1 illustrates the accuracy
achieved in the final few epochs, alongside the resulting final accuracy and mean
absolute error attained. These metrics provide insights into the model's performance
after additional training iterations with a larger dataset.In Figure 7.1.2, the confusion
matrix is presented to provide a detailed breakdown of the model's classification
performance. The confusion matrix offers valuable insights into the distribution of true
positive, true negative, false positive, and false negative predictions, aiding in the
assessment of the model's overall classification efficacy.
Fig:7.1.3 Training and Validation Loss of Model
BE/Dept. of AIML/BNMIT Page 39 2023-2024
Cyber Shield AI
Fig:7.1.4 Training and Accuracy of Model
The graphs depicted in Figures 7.1.3 and 7.1.4 illustrate the loss and accuracy trends
over epochs, along with the disparity between validation and training accuracy. These
visualizations offer a dynamic depiction of the model's learning progress and
highlight any potential overfitting or underfitting issues by showcasing the variance
between accuracy metrics on training and validation datasets across different epochs.
Fig 7.1.5 Saved URL with feature in dataset
BE/Dept. of AIML/BNMIT Page 40 2023-2024
Cyber Shield AI
Figure 7.1.5 displays the features collected from the URL provided in the frontend
through site scraping. This visualization presents a detailed breakdown of the
extracted features, providing insights into the characteristics of the analyzed URL.
Fig 7.1.6 Frontend execution
Figure 7.1.6 illustrates the user-input URL and the corresponding output, indicating
whether it is classified as phished or deemed safe, thereby aiding the user in making
informed decisions.
Fig 7.1.7 Spam Email Detection
Figure 7.1.7 visually represents the user inputting the email address, body of the
email, and any attachments, with the system providing detection feedback on whether
the email is classified as spam or legitimate.
BE/Dept. of AIML/BNMIT Page 41 2023-2024
Cyber Shield AI
7.2 Summary
The choice of a Convolutional Neural Network (CNN) with Random Forest for
URL detection in CyberShield AI was justified by its exceptional performance in
accurately
classifying phishing URLs. CNNs are particularly effective in image recognition
tasks due to their ability to identify patterns within images. In the context of URL
detection, URLs can be considered as sequences of characters similar to images,
and CNNs can effectively learn the patterns within these sequences. The URL
phishing detector module employs a feature extractor to analyze various
characteristics of URLs, such as their length, domain name, redirection count, and
subdomain. These features provide crucial insights into whether a URL is legitimate
or phishing. The utilization of CNNs in this module allows for the effective
extraction and analysis of these features from URL sequences. CNNs excel in
capturing hierarchical patterns in data, which is crucial in identifying the subtle
characteristics that differentiate phishing URLs from legitimate ones.
Moreover, the decision to combine CNNs with Random Forest is strategic. Random
Forest is an ensemble learning method known for its robustness and ability to handle
high- dimensional data. By stacking CNNs with Random Forest, the model benefits
from the strengths of both approaches. CNNs capture intricate patterns in URL
sequences, while Random Forest effectively aggregates the predictions from
multiple decision trees, leading to improved accuracy and generalization. During the
model selection process, various classifiers, including Random Forest, Decision
Trees, SVM, and Naïve Bayes, were trained and tested using different deep learning
architectures such as CNN, RNN, ANN, RCNN, and FNN. The performance of
each combination was evaluated using a labeled dataset of 5000 instances. Through
rigorous testing, the CNN with Random Forest model consistently demonstrated the
highest accuracy over the validation set.
BE/Dept. of AIML/BNMIT Page 42 2023-2024
CHAPTER 8
CONCLUSION AND SCOPE FOR FUTURE
ENHANCEMENTS
Chapter 8
CONCLUSION AND SCOPE FOR FUTURE
ENHANCEMENT
8.1 Conclusion
In conclusion, CyberShield AI represents a significant advancement in
cybersecurity, harnessing machine learning and deep learning to combat phishing
and spam attacks. The URL phishing detector module employs advanced feature
extraction and classification techniques, achieving high accuracy in distinguishing
between legitimate and malicious URLs. Complementarily, the spam email detector
module utilizes NLP for feature extraction and analysis, effectively identifying
spam emails through meticulous scrutiny of content and attachments.
Integral to the project's success is its seamless integration with MongoDB, ensuring
efficient data storage and retrieval. The user-friendly frontend interface empowers
users to navigate tasks effortlessly, providing actionable insights based on risk
assessment. It exemplifies the potential of AI in safeguarding digital ecosystems
against evolving threats. With its robust defense mechanisms and user-centric
design, it stands as a beacon of innovation in cybersecurity.
8.2 Future Work
In the future, CyberShield AI could be further strengthened by implementing real-
time threat detection capabilities to swiftly identify and mitigate emerging cyber
threats. Enhanced feature extraction techniques, such as semantic analysis and deep
learning-based feature embeddings, can provide deeper insights into the context and
intent behind suspicious URLs and email content. Integration with external threat
intelligence feeds can enrich the system's knowledge base, while user education and
awareness initiatives can empower users to recognize and respond to cyber threats
proactively. Additionally, optimizing scalability and performance through
distributed computing and cloud-based infrastructure will ensure the system can
handle growing data volumes and user demand without compromising performance.
BE/Dept. of AIML/BNMIT Page 44 2023-2024
APPENDIX - I
PROJECT CONTRIBUTION
Type of the Project Which of the following aspects are covered in this project?
Application / Product New Technology Safety Ethics Cost Society
Cyber Shield AI ✔ X ✔ ✔ ✔
REFERENCES
[1] M. Aljabri et al., "Detecting Malicious URLs Using Machine Learning Techniques:
Review and Research Directions," in IEEE Access, vol. 10, pp. 121395-121417, 2022, doi:
10.1109/ACCESS.2022.3222307.
[2] A. Chawla, “Phishing website analysis and detection using Machine Learning”, Int J Intell
Syst Appl Eng, vol. 10, no. 1, pp. 10–16, Mar. 2022.
[3] Z. Azam, M. M. Islam and M. N. Huda, "Comparative Analysis of Intrusion Detection
Systems and Machine Learning-Based Model Analysis Through Decision Tree," in IEEE
Access, vol. 11, pp. 80348-80391, 2023, doi: 10.1109/ACCESS.2023.3296444.
[4] Shouq Alnemari and Majid Alshammari, "Detecting Phishing Domains Using Machine
Learning" in Appl. Sci. 2023, 13(8), 4649, doi: 10.3390/app13084649
[5] M. Almousa and M. Anwar, "A URL-Based Social Semantic Attacks Detection With
Character-Aware Language Model," in IEEE Access, vol. 11, pp. 10654-10663, 2023, doi:
10.1109/ACCESS.2023.3241121.
[6] A.A. Orunsolu, A.S. Sodiya, A.T. Akinwale, A predictive model for phishing detection,
Journal of King Saud University - Computer and Information Sciences, Volume 34, Issue 2,
2022, Pages 232-247, ISSN 1319-1578
[7] M. N. Ashtiani and B. Raahemi, "Intelligent Fraud Detection in Financial Statements
Using Machine Learning and Data Mining: A Systematic Literature Review," in IEEE
Access, vol. 10, pp. 72504-72525, 2022, doi: 10.1109/ACCESS.2021.3096799.
[8] Mughaid, A., AlZu’bi, S., Hnaif, A. et al. An intelligent cyber security phishing detection
system using deep learning techniques. Cluster Comput 25, 3819–3828 (2022).
[9] P.D.P.L. Indrasiri, Malka N. Halgamuge, Azeem Mohammad, "Robust Ensemble
Machine Learning Model for Filtering Phishing URLs: Expandable Random Gradient
Stacked Voting Classifier (ERG-SVC)" in IEEE Access, vol. 9, pp. 150142-150161, doi:
10.1109/ACCESS.2021.3124628
[10] Sandeep Rangineni, Divya Marupaka "Analysis of Data Engineering for Fraud Detection
Using Machine Learning and Artificial Intelligence Technologies" in IEEE Access, pp. 2582-
5208, doi: 10.56726/IRJMETS43408
[11] Tanimu, Jibrilla & Shiaeles, Stavros. (2022). Phishing Detection Using Machine
Learning Algorithm. 317-322. 10.1109/CSR54599.2022.9850316.
[12] L. Tang and Q. H. Mahmoud, "A Deep Learning-Based Framework for Phishing
Website Detection," in IEEE Access, vol. 10, pp. 1509-1521, 2022, doi:
10.1109/ACCESS.2021.3137636.
[13] K. Mridha, J. Hasan, S. D and A. Ghosh, "Phishing URL Classification Analysis Using
ANN Algorithm," 2021 IEEE 4th International Conference on Computing, Power and
Communication Technologies (GUCON), Kuala Lumpur, Malaysia, 2021, pp. 1-7, doi:
10.1109/GUCON50781.2021.9573797.
[14] Vahid Shahrivari, Mohammad Mahdi Darabi and Mohammad Izadi, “ Phishing
Detection Using Machine Learning Techniques”, CoRR - 2020, vol. abs/2009.11116, doi:
10.48550/arXiv.2009.11116
[15] Mughaid, A., AlZu’bi, S., Hnaif, A. et al. An intelligent cyber security phishing
detection system using deep learning techniques. Cluster Comput 25, 3819–3828 (2022).
https://doi.org/10.1007/s10586-022-03604-4
[16] Sahingoz, Ozgur & Buber, Ebubekir & Demir, Onder & Diri, Banu. (2019). Machine
learning based phishing detection from URLs. Expert Systems with Applications. 117. 345-
357.
[17] Vahid Shahrivari , Mohammad Mahdi Darabi and Mohammad Izadi, “ Phishing
Detection Using Machine Learning Techniques”, CoRR - 2020, vol. abs/2009.11116, doi:
10.48550/arXiv.2009.11116
[18] D, Naresh. (2020). Detection of Phishing Websites using an Efficient Machine Learning
Framework. International Journal of Engineering Research and. V9.
10.17577/IJERTV9IS050888.
[19] I. Saha, D. Sarma, R. J. Chakma, M. N. Alam, A. Sultana and S. Hossain, "Phishing
Attacks Detection using Deep Learning Approach," 2020 Third International Conference on
Smart Systems and Inventive Technology (ICSSIT), Tirunelveli, India, 2020, pp. 1180-1185,
doi: 10.1109/ICSSIT48917.2020.9214132.
[20] M. N. Alam, D. Sarma, F. F. Lima, I. Saha, R. -E. -. Ulfath and S. Hossain, "Phishing
Attacks Detection using Machine Learning Approach," 2020 Third International Conference
on Smart Systems and Inventive Technology (ICSSIT), Tirunelveli, India, 2020, pp. 1173-
1179, doi: 10.1109/ICSSIT48917.2020.9214225.
[21] S. Anwar et al., "Countering Malicious URLs in Internet of Things Using a Knowledge-
Based Approach and a Simulated Expert," in IEEE Internet of Things Journal, vol. 7, no. 5,
pp. 4497-4504, May 2020, doi: 10.1109/JIOT.2019.2954919.
[22] Almseidin, Mohammad & Abuzuraiq, Almaha & Alkasassbeh, Mouhammd & Alnidami,
Nidal. (2019). Phishing Detection Based on Machine Learning and Feature Selection Methods.
International Journal of Interactive Mobile Technologies (iJIM). 13. 171.
10.3991/ijim.v13i12.11411.
[22] N. Z. Harun, N. Jaffar, and P. S. J. Kassim, ‘‘Physical attributes significantin preserving
the social sustainability of the traditional malay settlement,’’in Reframing the Vernacular:
Politics, Semiotics, and Representation.Springer, 2023, pp. 225–238.
[23] D. M. Divakaran and A. Oest, ‘‘Phishing detection leveraging machinelearning and deep
learning: A review,’’ 2022, arXiv:2205.07411.
[24] A. Akanchha, ‘‘Exploring a robust machine learning classifier for detectingphishing
domains using SSL certificates,’’ Fac. Comput. Sci., DalhousieUniv., Halifax, NS, Canada,
Tech. Rep. 10222/78875, 2022.
[25] H. Shahriar and S. Nimmagadda, ‘‘Network intrusion detection forTCP/IP packets with
machine learning techniques,’’ in Machine Intelligence and Big Data Analytics for
Cybersecurity Applications. Cham, Switzerland: Springer, 2020, pp. 231–247.
[26] J. Kline, E. Oakes, and P. Barford, ‘‘A URL-based analysis of WWWstructure and
dynamics,’’ in Proc. Netw. Traffic Meas. Anal. Conf. (TMA), Jun. 2019, p. 800.
[27] A. K. Murthy and Suresha, ‘‘XML URL classification based on their semantic structure
orientation for web mining applications,’’ Proc. Computer. Science., vol. 46, pp. 143–150,
Jan. 2021.
[28] A. A. Ubing, S. Kamilia, A. Abdullah, N. Jhanjhi, and M. Supramaniam,‘‘Phishing
website detection: An improved accuracy through feature selection and ensemble
learning,’’Int. J. Adv. Comput. Sci. Appl., vol. 10, no. 1,pp. 252–257, 2021.
[29] A. Aggarwal, A. Rajadesingan, and P. Kumaraguru, ‘‘PhishAri: Automaticrealtime
phishing detection on Twitter,’’ in Proc. eCrime Res. Summit,Oct. 2021, pp. 1–12.
[30] S. N. Foley, D. Gollmann, and E. Snekkenes, Computer Security—ESORICS 2017, vol.
10492. Oslo, Norway: Springer, Sep. 2020.
[31] P. George and P. Vinod, ‘‘Composite email features for spam identification,’’ in Cyber
Security. Singapore: Springer, 2018, pp. 281–289.
[32] H. S. Hota, A. K. Shrivas, and R. Hota, ‘‘An ensemble model for detecting phishing
attack with proposed remove-replace feature selection technique,’’ Proc. Comput. Sci.,
vol.132, pp. 900–907, Jan. 2019.
[33] G. Sonowal and K. S. Kuppusamy, ‘‘PhiDMA—A phishing detectionmodel with multi-
filter approach,’’ J. King Saud Univ., Comput. Inf. Sci.,vol. 32, no. 1, pp. 99–112, Jan. 2020.
[34] M. Zouina and B. Outtaj, ‘‘A novel lightweight URL phishing detectionsystem using
SVM and similarity index,’’ Hum.-Centric Comput. Inf. Sci.,vol. 7, no. 1, p. 17, Jun. 2022.
[35] R. Ø. Skotnes, ‘‘Management commitment and awareness creation—ICTsafety and
security in electric power supply network companies,’’ Inf.Comput. Secur., vol. 23, no. 3, pp.
302–316, Jul. 2023.
[36] R. Prasad and V. Rohokale, ‘‘Cyber threats and attack overview,’’ in CyberSecurity:
The Lifeline of Information and Communication Technology.Cham, Switzerland: Springer,
2020, pp. 15–31.
[37] T. Nathezhtha, D. Sangeetha, and V. Vaidehi, ‘‘WC-PAD: Web crawlingbased phishing
attack detection,’’ in Proc. Int. Carnahan Conf. Secur.Technol. (ICCST), Oct. 2019, pp. 1–6.
[38] R. Jenni and S. Shankar, ‘‘Review of various methods for phishingdetection,’’ EAI
Endorsed Trans. Energy Web, vol. 5, no. 20, Sep. 2022, Art. no. 155746.