Survey of Machine Learning in Phishing Detection Research
Survey of Machine Learning in Phishing Detection Research
Turnitin LLC
Across the Spectrum (A Comprehensive Survey of Machine
Learning in Phishing Detection Research).docx
Course
CLASS-A0F7C44B4E06842E
DeVry University
Document Details
Submission ID
trn:oid:::1:2798103929 19 Pages
Download Date
File Name
Survey_of_Machine_Learning_in_Phishing_Detection_Research.docx
File Size
142.0 KB
45%
Caution: Percentage may not indicate academic misconduct. Review required.
Our testing has found that there is a higher incidence of false positives when the percentage is less than 20. In order to reduce the
likelihood of misinterpretation, the AI indicator will display an asterisk for percentages less than 20 to call attention to the fact that
the score is less reliable.
However, the final decision on whether any misconduct has occurred rests with the reviewer/instructor. They should use the
percentage as a means to start a formative conversation with their student and/or use it to examine the submitted assignment in
greater detail according to their school's policies.
Non-qualifying text, such as bullet points, annotated bibliographies, etc., will not be processed and can create disparity between the submission highlights and the
percentage shown.
In a longer document with a mix of authentic writing and AI generated text, it can be difficult to exactly determine where the AI writing begins and original writing
ends, but our model should give you a reliable guide to start conversations with the submitting student.
Disclaimer
Our AI writing assessment is designed to help educators identify text that might be prepared by a generative AI tool. Our AI writing assessment may not always be accurate (it may misidentify
both human and AI-generated text) so it should not be used as the sole basis for adverse actions against a student. It takes further scrutiny and human judgment in conjunction with an
organization's application of its specific academic policies to determine whether any academic misconduct has occurred.
1 Introduction:
The Internet has become an integral aspect of people's daily lives, playing a crucial role that is
hard to envision a world without. As of January 2021, the global digital population report indicates
a staggering 4.66 billion active Internet users worldwide, constituting 59.5% of the global
population. Notably, 92.6% of these users access the Internet through smartphones [1]. This
widespread connectivity has revolutionized various facets of life, including information exchange,
online shopping, communication, and professional tasks. The onset of the 2019 pandemic
prompted a significant shift from traditional offline services to online platforms, particularly in
industries like catering and retail. In this digitally immersed era, individuals frequently share
sensitive data online, ranging from usernames and passwords to personal information and credit
card details. Unfortunately, cybercriminals exploit various illicit methods to acquire such
information, subsequently engaging in unauthorized activities on the Internet. Network security
concerns have been present since the early days of the Internet's inception, evolving in tandem
with its development. The rapid evolution of network attack techniques poses significant
challenges to cybersecurity. Noteworthy categories of cybersecurity issues, classified based on
attack methods and forms, include denial-of-service attacks (DoS), man-in-the-middle attacks
(MitM), SQL injection, zero-day exploits, DNS tunneling, phishing, and malware. The dynamic
landscape of the Internet and its vulnerabilities necessitate ongoing efforts to enhance
cybersecurity measures and protect users from potential threats.
Phishing stands as a sophisticated network attack that skillfully blends social engineering tactics
with computer technology to illicitly acquire sensitive personal information from unsuspecting
users. This method involves cyber attackers enticing individuals to click on fraudulent links
through deceptive emails, SMS messages, or social media communications. With a history
spanning over 30 years, phishing continues to be a pervasive threat, leading to substantial
economic losses as a considerable number of users fall victim each year. The year 2020 witnessed
a substantial surge in phishing attacks, exacerbating the prevalence of this cyber threat [2]. The
global COVID-19 pandemic prompted many countries to implement financial assistance programs
Anti-phishing strategies encompass a dual approach, involving both user education and technical
defense. This paper predominantly focuses on reviewing the technical defense methodologies that
have emerged in recent years, with a particular emphasis on the identification of phishing websites,
a pivotal step in the process of deceiving user information. Numerous academic research endeavors
and commercial products have been introduced to detect phishing websites. Traditional methods
often adopt list-based solutions, where valid and legitimate websites are compiled into whitelists,
and confirmed phishing sites are added to blacklists. These lists are then widely disseminated to
prevent other users from falling victim to attacks. This approach proves effective in preventing the
reuse of known phishing website URLs, thereby minimizing the number of affected users and
potential losses. Notably, these methods are commonly employed in real-time defensive actions
due to their low computational time cost, facilitated by single-string match algorithms. However,
a significant drawback of list-based solutions lies in their inability to promptly detect new phishing
URLs. This limitation exposes innocent users to potential attacks before the malicious link is added
to a blacklist. In response, some researchers have proposed rule-based methods designed to
identify novel fraudulent websites. This methodology involves leveraging the expertise of security
professionals and conducting in-depth analyses of phishing sites. Adhering to the W3C standard
for URL components, which include the protocol, subdomain, domain name, port, path, query,
parameters, and fragment, rules are generated. For instance, rules may be established based on the
similarity of the domain name to other legitimate domains. Some rules may necessitate third-party
services to obtain additional information, such as the registration date of the domain. However, it
is crucial to acknowledge that the dissemination of rules through technical articles poses a risk.
Cybercriminals can study these rules and adapt, creating new phishing URLs that deviate from the
With the continuous evolution of machine learning techniques, a plethora of methodologies rooted
in machine learning has surfaced for the purpose of recognizing phishing websites, aiming to
enhance prediction performance. Phishing detection, as a supervised classification approach,
leverages labeled datasets to train models that can effectively classify data. Various algorithms are
employed in this supervised learning process, including but not limited to naïve Bayes, neural
networks, linear regression, logistic regression, decision trees, support vector machines, K-nearest
neighbor, and random forests. The development of practical products necessitates robust solutions
that must meet two key requirements. Firstly, there is a need for high accuracy coupled with a low
false warning rate. Achieving optimal model performance often hinges on the availability of
substantial datasets, particularly for complex structures like neural networks. Moreover, the
computational time factor is critical for the viability of real-time detection systems.
This paper primarily aims to survey effective methods for preventing phishing attacks in real-time
environments. It delineates the basic life cycle of a phishing attack, starting from the entry point,
with a specific focus on the phase when a user clicks on a phishing link. The paper employs
technical methods to identify the phishing link and promptly alert the user. In addition to
conventional approaches such as blacklist matching and recognition methods, this paper provides
an in-depth exploration of machine learning-based URL detection technology. By presenting state-
of-the-art solutions, the paper engages in a comparative analysis, highlighting the challenges and
limitations inherent in each solution. Furthermore, it offers insights into potential research
directions and future solutions. The main contributions of this paper lie in its comprehensive
review of effective anti-phishing methods, its analytical comparison of various solutions, and its
provision of valuable ideas for future research in this domain.
2. A survey of major datasets and data sources for phishing detection websites;
This paper is structured as follows. In Section 2, we delve into the background and related work
concerning phishing. Section 3 provides an exhaustive exploration of methodologies for detecting
website phishing, categorizing them into list-based methods, heuristic strategies, and machine
learning-based solutions. Specifically, we elucidate the intricate details of the general architecture
underlying phishing network detection solutions based on machine learning. Section 4 introduces
various frameworks employed in website phishing detection systems. Moving forward, Section 5
outlines cutting-edge machine learning-based solutions, organized into three distinct categories
based on the number and characteristics of the learning model. Section 6 is dedicated to a thorough
discussion of the challenges associated with detecting phishing attacks. Finally, in Section 7, we
draw conclusions from the insights gathered in this study. Each section contributes to the
overarching goal of presenting a comprehensive overview of effective methods, frameworks, and
state-of-the-art solutions in the realm of phishing detection, while also addressing the complexities
and obstacles inherent in this evolving field.
2 Literature Review
Phishing stands as a prevalent cyberattack tactic wherein perpetrators send deceptive emails or
messages with the aim of tricking recipients into visiting fraudulent pages. The ultimate goal is to
illicitly gather sensitive user data, including usernames, passwords, and credit card numbers, for
financial exploitation. Figure 1 illustrates the typical phishing life cycle. In the initial stage, an
attacker creates a phishing website designed to closely resemble a legitimate counterpart. The
manipulation of the URL is a common strategy employed by attackers, involving tactics like
spelling errors, the use of similar alphabetic characters, and other methods to mimic the URL of
the authentic website. Notably, attackers often focus on imitating the domain name and network
resource directory. For instance, a deceptive link like "https://aimazon.amz-
z7acyuup9z0y16.xyz/v" (accessed on 9 January 2024) attempts to emulate the legitimate
https://www.amazon.com. While a computer's browser may reveal the actual URL by hovering
over the link, such distinctions can be challenging for the average user to discern visually or from
memory, making them susceptible to these imitative tactics. Beyond manipulating URLs, attackers
also emphasize the imitation of web content. This involves the use of scripts to extract logos, web
layouts, and text directly from authentic web pages. Cybercriminals commonly replicate form
submission pages that necessitate users to input sensitive information. Examples include forged
login pages, payment pages, and password recovery pages, all designed to dupe users into
divulging confidential data. This multifaceted approach underscores the sophistication and
deceptive nature of phishing attacks, exploiting both technical and psychological vulnerabilities of
unsuspecting users.
The second phase of the phishing life cycle involves sending emails that strongly encourage
recipients to click on the provided link. Phishing links are not limited to emails but can also be
disseminated through SMS, voice messages, QR codes, and even spoofed mobile applications [4].
With the prevalent use of smartphones and social media, criminals have expanded their channels
for spreading false information. In these methods, text and images are commonly employed to
deceive individuals into clicking on the deceptive link. For instance, an attacker might impersonate
the customer service of a telecommunications company, sending emails that urge users to make
payments to prevent service disruptions. While scam emails are typically sent indiscriminately,
there is always a subset of users with lower defensive awareness who may fall victim to such
tactics. In this step, attackers leverage social engineering methods, including psychological
manipulation, to induce users into making security mistakes. Perpetrators are adept at instilling a
sense of fear and urgency while gaining the user's trust through text messages. Subsequently, users
click on the link, which redirects them to open a fraudulent website. Notably, real URL strings are
concealed before redirection to web browsers on mobile phones.
The subsequent step involves the collection of personal information on the fake website, designed
to closely resemble the legitimate web page of a company or organization. This mimicry includes
using a similar logo, name, user interface design, and content. This deceptive tactic is frequently
employed in processes like login, password reset, payment, and renewal of personal information.
When users submit sensitive information on web servers established by attackers, criminals gain
access to all the data provided.
The final step entails stealing the user's account funds by utilizing the user's genuine information
to fabricate requests on a legitimate website. Some individuals use the same usernames and
passwords across multiple websites, allowing attackers to compromise multiple accounts
belonging to the user. In some instances, phishers leverage stolen data for further criminal
activities. Since the inception of phishing techniques documented in a paper in 1987 [5], these
methods have evolved with the development of the Internet. For instance, with the rise of online
payment systems, attackers have shifted their focus to online payment phishing. According to the
2020 Internet Crime Report, the Internet Crime Complaint Center (IC3) received 791,790
cyberattack complaints, with phishing scams accounting for around 30% of these, making them
the most reported type of cybercrime and causing over USD 54 million in losses [2]. Consequently,
for individuals navigating the Internet, the ability to distinguish between genuine and fraudulent
web pages is imperative. Visual tools are essential to aid users in identifying phishing websites.
2.1 Anti-Phishing
Indeed, as depicted in Figure 1, the phishing life cycle involves five distinct steps leading up to an
attacker stealing money from the user's account or utilizing the gathered information for other
malicious activities. Recognizing the potential impact at each stage underscores the significance
of preventing and mitigating phishing attacks. Consequently, thwarting any one of these steps can
effectively halt a phishing attack. Here, we delve into the method of anti-phishing, examining
preventive measures starting from each stage of the phishing life cycle. By addressing
vulnerabilities and implementing countermeasures at each phase, organizations and individuals
can enhance their resilience against these deceptive cyber threats.
attackers. One effective method involves obfuscation, which includes the use of CSS sprites to
display critical data. CSS sprites involve combining multiple images into a single image file and
then using CSS to display specific portions of that image as needed. This technique makes it more
challenging for automated scripts to decipher and extract relevant information, as the data is not
presented in a straightforward manner. Another strategy is to replace text with images. By
converting textual content into image files, legitimate websites introduce an additional layer of
complexity for scraping tools. Automated scripts find it harder to interpret and extract meaningful
data from images compared to plain text. This approach adds an extra barrier for attackers seeking
to collect information from web pages. By implementing these obfuscation techniques, legitimate
websites can introduce hurdles and increase the effort required for attackers to scrape content.
While not foolproof, these measures contribute to raising the costs and challenges associated with
creating fraudulent web pages through automated means.
Facial Movement Recognition: This involves analyzing facial features and movements in real-
time to authenticate the user.
Expression Recognition: This method assesses the user's facial expressions during the login
process to verify their identity.
Voiceprint Recognition: The user's unique vocal characteristics are analyzed and compared to a
stored voiceprint for authentication.
Implementing these dynamic and biometric verification methods adds an extra layer of security by
requiring users to prove their authenticity through characteristics that are difficult for attackers to
replicate or manipulate. This multi-factor authentication approach enhances the overall security
posture of online platforms and helps protect users from unauthorized access and fraudulent
activities.
In a review by Singh et al. on machine learning-based phishing detection [9], the authors provide
a concise history of phishing and major attack reports. Phishing attacks are categorized into social
engineering attacks and malware-based phishing. Features are classified into three categories—
source code features, URL features, and image features—all based on rules.
In 2020, Vijayalakshmi et al. presented a survey on major detection techniques and a taxonomy
for detecting phishing [10]. Utilizing a statistical report from APWG, the paper outlines the trends
of phishing attacks from 2017 to 2019. The survey introduces a taxonomy of automated phishing
detection solutions, classifying them into three categories based on input parameters: web address-
based methods, webpage content-based solutions, and hybrid approaches. Web address-based
approaches are further divided into list-based, heuristic rule-based, and learning-based methods,
while web content-based solutions are separated into rule-based and machine learning-based
solutions. The authors list most state-of-the-art methodologies for each category, providing
detailed interpretations of each solution. Through a comparison of methods using various
evaluation metrics, including classification performance, limitations, third-party service
independence, and zero-hour attack detection, the authors suggest that hybrid approaches can
achieve high accuracy rates and are suitable for real-time systems. Additionally, they anticipate
that deep learning-based solutions will play a crucial role in the future.
Kalaharsha and Mehtre conducted a survey on phishing detection solutions, categorizing them
based on the applied techniques and input parameters [11]. The paper introduces various types of
phishing attacks and three phishing techniques. The authors outline 18 methods and 9 datasets for
detecting phishing websites, conducting a comparative analysis of accuracy performance across
all models. The paper also addresses challenges in the field, including the reduction of false-
positive rates and mitigating overfitting.
In a more recent survey, Jain and Gupta provide a comprehensive analysis of phishing attack
techniques, detection methods, and existing challenges [12]. They incorporate statistical reports
and motivations behind phishing attacks, presenting diverse phishing attack techniques targeting
both PCs and smartphones. The authors introduce various defense methods and compare existing
anti-phishing approaches published between 2006 and 2017, highlighting their advantages and
limitations. The survey also outlines major challenges, such as the selection of efficient features,
identification of tiny URLs, and the detection of phishing attacks on smartphones.
Subsequently, a comparison is made between the legal domain and the domain of the target website
to determine its phishing status [15].
In a different approach, Chiew et al. utilized a website's logo image as a distinguishing feature to
verify its legitimacy [16]. The authors employed machine learning algorithms to extract logos from
webpage images and then conducted a domain query using the logo as a keyword in the Google
search engine. This category is often referred to as a search engine-based approach due to its
reliance on search engines for verification.
In a 2012 paper, Mohammad et al. introduced an automatic technique for feature extraction from
phishing websites, collecting 2500 phishing URLs from the phishTank archive [18]. They
extracted 17 features, categorized into address bar-based, abnormal-based, and HTML and
JavaScript-based features. Most features were automatically extracted from the URL and the web
page's source code without relying on third-party services. The age of the domain and DNS record,
however, was obtained from the WHOIS database [19], and the web page rank was sourced from
the Alexa database [20]. The authors proposed an IF-ELSE rule and assigned weights to each
feature based on its contribution to phishing link classification.
In 2015, Mohammad et al. published a phishing website dataset on the UCI Machine Learning
Repository, becoming a foundational resource for machine learning-based phishing detection
solutions, containing 11,055 instances with 30 features [22]. Choon also contributed a phishing
dataset on Mendeley in 2018, containing 10,000 data rows with 48 features extracted from
phishTank and OpenPhish for phishing webpages. Published datasets are generally smaller in
comparison to other machine learning programs. Therefore, resampling techniques, such as N-fold
cross-validation, are often employed, splitting the data into N pieces iterated N times, with each
iteration selecting one piece as testing data and the others as training data.
Alternatively, some researchers opt to collect URLs directly from the Internet, obtaining them
from sources like phishTank, OpenPhish, and Spamhaus.org for phishing URLs, and
dmoztools.net, Alexa, and Common Crawl for legitimate websites. They then parse the features
themselves, contributing to a diverse array of data sources in the field.
With the successful development of natural language processing (NLP) techniques, many
researchers have adopted the capture of character-level features from URL strings based on NLP,
integrating them into deep learning models to enhance accuracy. This approach presents significant
advantages, including independence from cybersecurity expertise and the avoidance of reliance on
third-party network services [24].
Given that characters in URLs are continuous and lack semantic distinctions, character-level
features, such as character-level TF-IDF features, are utilized. TF-IDF, which stands for Term
Frequency–Inverse Document Frequency, operates at the character level, treating each character
as a term. The algorithm calculates the TF-IDF score for each character, generating a matrix
reflecting the relevance of each character in the URL string. Taking "https://www.google.com/"
(accessed on 18 July 2021) as an example, it comprises 17 characters ("h", "t", "t", "p", "s", ":",
"/", "/", "w", "w", "w", ".", "g", "o", "o", "g", "l", "e", ".", "c", "o", "m"), forming a character level
17-gram in the corpus. Consequently, it generates a vector with 17 TF-IDF scores. The TF-IDF
score for a single character is calculated using the mathematical formulation shown below:
Zamir et al. employed various techniques, including recursive feature elimination (RFE),
information gain (IG), and relief ranking, to eliminate redundant features in phishing detection.
Additionally, they introduced principal components analysis (PCA) for attribute analysis. IG
serves as an indicator, calculating the importance of features based on class probability, feature
probability, and class probability under a feature condition. RFE is a widely used algorithm for
feature reduction, eliminating the least essential features until the error rate meets expectations.
The relief ranking filter, another technique used by Shabudin et al., calculates feature value scores
by comparing the values of two adjacent data points using the nearest neighbor search algorithm.
It then sorts these scores to obtain feature value weights. After this process, they obtained 22
features with weights ranking and removed eight redundant features with zero scores.
Zabihimayvan and Doran applied Fuzzy Rough Set (FRS) theory to select crucial features from
the UCI dataset and Mendeley dataset for phishing detection. FRS is an extension of Rough Set
(RS) theory, which determines a decision boundary by assessing the equality of data points based
on certain features and the same classes. FRS is particularly suitable for datasets where features
are discrete, as in the original UCI dataset. After nominalization, where features are converted to
continuous values between 0 and 1, FRS is applied.
El-Rashidy proposed a novel technique for feature selection in web phishing detection models in
2021. This method consists of two phases. The first phase calculates the absence impact of each
feature by training a random forest model with a new dataset that removes one feature at a time.
The second phase involves training and testing the model, starting with one feature and
progressively adding features from a ranked list. The feature subset with the highest accuracy is
selected. This method is effective for selecting the most impactful feature subset, but its
computational complexity and time consumption make it more suitable for small feature sizes and
single classifiers.
3.3.3 Modeling
Machine learning models can be categorized into three types: single classifiers, hybrid models,
and deep learning approaches. Hybrid models integrate multiple algorithms during the training
process. Detecting phishing websites is typically framed as a binary classification problem. Here
are some commonly used classification algorithms:
Support Vector Machine (SVM): This supervised learning algorithm classifies data points into
two sections, predicting new data points within each. It is suitable for linear binary classification,
using a hyperplane with N dimensions corresponding to the features. The primary idea is to
maximize the distance between data points and the segmentation hyperplane. For instance, when
utilizing the UCI dataset for SVM training in a phishing vs. legitimate scenario, a 29-dimensional
hyperplane is employed.
Decision Tree: This popular algorithm has a tree structure where each node represents a feature,
stems present feature values and possibilities, and the last node provides the result. Simpler tree
structures generally yield better performance, but deep trees may lead to overfitting.
Random Forest: An ensemble of decision trees used for classification and regression. It addresses
overfitting by combining or averaging individual tree outputs during training, resulting in generally
higher accuracy compared to standalone decision trees.
continuous data and Hamming distance for discrete values are used. Notably, k-NN lacks a formal
training process, and each prediction may be time-consuming, making it less suitable for real-time
scenarios.
Naive Bayes: Naive Bayes is a probabilistic statistical algorithm grounded in Bayes' theorem,
known for its robust independence features. Bayes' theorem is a conditional probability theory,
and the naive Bayes classifier incorporates simplicity and independence principles.
The rapid evolution of deep learning and the success of natural language processing (NLP) have
led to the development of various deep learning models for phishing detection. These models
extract information and sequential patterns from URL strings without relying on source code
features from webpage content. They don't require specialized cybersecurity knowledge and often
utilize third-party services to capture characteristics.
RNN (Recurrent Neural Network): RNN is a deep neural network equipped with an internal
memory function to handle sequences of varying lengths, making it particularly effective in text-
related tasks like text mining.
True Positives (TP): The count of positive data points correctly identified by the classifier.
True Negatives (TN): The count of negative data points correctly identified by the classifier.
False Positives (FP): The count of negative data points erroneously labeled as positive by the
classifier.
False Negatives (FN): The count of positive data points mistakenly labeled as negative by the
model.
These metrics, encapsulated in Table 3, provide a comprehensive insight into how well the
classifier performs in accurately identifying positive and negative instances within the testing
dataset.