[go: up one dir, main page]

0% found this document useful (0 votes)
105 views21 pages

Survey of Machine Learning in Phishing Detection Research

This document provides an overview of machine learning techniques for detecting phishing websites. It begins with background on the prevalence of phishing attacks and importance of detecting phishing websites. It then discusses traditional list-based detection methods using blacklists and whitelists, noting their inability to identify new phishing sites quickly. The document proposes that rule-based and machine learning methods may help detect novel phishing websites faster. It intends to comprehensively survey the use of machine learning in phishing detection research.

Uploaded by

Hafizshoaib381
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
105 views21 pages

Survey of Machine Learning in Phishing Detection Research

This document provides an overview of machine learning techniques for detecting phishing websites. It begins with background on the prevalence of phishing attacks and importance of detecting phishing websites. It then discusses traditional list-based detection methods using blacklists and whitelists, noting their inability to identify new phishing sites quickly. The document proposes that rule-based and machine learning methods may help detect novel phishing websites faster. It intends to comprehensively survey the use of machine learning in phishing detection research.

Uploaded by

Hafizshoaib381
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Page 1 of 21 - Cover Page Submission ID trn:oid:::1:2798103929

Turnitin LLC
Across the Spectrum (A Comprehensive Survey of Machine
Learning in Phishing Detection Research).docx
Course

CLASS-A0F7C44B4E06842E

DeVry University

Document Details

Submission ID

trn:oid:::1:2798103929 19 Pages

Submission Date 5,893 Words

Jan 12, 2024, 7:55 AM CST


36,135 Characters

Download Date

Jan 12, 2024, 8:10 AM CST

File Name

Survey_of_Machine_Learning_in_Phishing_Detection_Research.docx

File Size

142.0 KB

Page 1 of 21 - Cover Page Submission ID trn:oid:::1:2798103929


Page 2 of 21 - AI Writing Overview Submission ID trn:oid:::1:2798103929

How much of this submission has been generated by AI?

45%
Caution: Percentage may not indicate academic misconduct. Review required.

It is essential to understand the limitations of AI detection before making decisions


about a student's work. We encourage you to learn more about Turnitin's AI detection
capabilities before using the tool.
of qualifying text in this submission has been determined to be
generated by AI.

Frequently Asked Questions

What does the percentage mean?


The percentage shown in the AI writing detection indicator and in the AI writing report is the amount of qualifying text within the
submission that Turnitin's AI writing detection model determines was generated by AI.

Our testing has found that there is a higher incidence of false positives when the percentage is less than 20. In order to reduce the
likelihood of misinterpretation, the AI indicator will display an asterisk for percentages less than 20 to call attention to the fact that
the score is less reliable.

However, the final decision on whether any misconduct has occurred rests with the reviewer/instructor. They should use the
percentage as a means to start a formative conversation with their student and/or use it to examine the submitted assignment in
greater detail according to their school's policies.

How does Turnitin's indicator address false positives?


Our model only processes qualifying text in the form of long-form writing. Long-form writing means individual sentences contained in paragraphs that make up a
longer piece of written work, such as an essay, a dissertation, or an article, etc. Qualifying text that has been determined to be AI-generated will be highlighted blue
on the submission text.

Non-qualifying text, such as bullet points, annotated bibliographies, etc., will not be processed and can create disparity between the submission highlights and the
percentage shown.

What does 'qualifying text' mean?


Sometimes false positives (incorrectly flagging human-written text as AI-generated), can include lists without a lot of structural variation, text that literally repeats
itself, or text that has been paraphrased without developing new ideas. If our indicator shows a higher amount of AI writing in such text, we advise you to take that
into consideration when looking at the percentage indicated.

In a longer document with a mix of authentic writing and AI generated text, it can be difficult to exactly determine where the AI writing begins and original writing
ends, but our model should give you a reliable guide to start conversations with the submitting student.

Disclaimer
Our AI writing assessment is designed to help educators identify text that might be prepared by a generative AI tool. Our AI writing assessment may not always be accurate (it may misidentify
both human and AI-generated text) so it should not be used as the sole basis for adverse actions against a student. It takes further scrutiny and human judgment in conjunction with an
organization's application of its specific academic policies to determine whether any academic misconduct has occurred.

Page 2 of 21 - AI Writing Overview Submission ID trn:oid:::1:2798103929


Page 3 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

Across the Spectrum: A Comprehensive Survey of Machine Learning


in Phishing Detection Research

1 Introduction:
The Internet has become an integral aspect of people's daily lives, playing a crucial role that is
hard to envision a world without. As of January 2021, the global digital population report indicates
a staggering 4.66 billion active Internet users worldwide, constituting 59.5% of the global
population. Notably, 92.6% of these users access the Internet through smartphones [1]. This
widespread connectivity has revolutionized various facets of life, including information exchange,
online shopping, communication, and professional tasks. The onset of the 2019 pandemic
prompted a significant shift from traditional offline services to online platforms, particularly in
industries like catering and retail. In this digitally immersed era, individuals frequently share
sensitive data online, ranging from usernames and passwords to personal information and credit
card details. Unfortunately, cybercriminals exploit various illicit methods to acquire such
information, subsequently engaging in unauthorized activities on the Internet. Network security
concerns have been present since the early days of the Internet's inception, evolving in tandem
with its development. The rapid evolution of network attack techniques poses significant
challenges to cybersecurity. Noteworthy categories of cybersecurity issues, classified based on
attack methods and forms, include denial-of-service attacks (DoS), man-in-the-middle attacks
(MitM), SQL injection, zero-day exploits, DNS tunneling, phishing, and malware. The dynamic
landscape of the Internet and its vulnerabilities necessitate ongoing efforts to enhance
cybersecurity measures and protect users from potential threats.

Phishing stands as a sophisticated network attack that skillfully blends social engineering tactics
with computer technology to illicitly acquire sensitive personal information from unsuspecting
users. This method involves cyber attackers enticing individuals to click on fraudulent links
through deceptive emails, SMS messages, or social media communications. With a history
spanning over 30 years, phishing continues to be a pervasive threat, leading to substantial
economic losses as a considerable number of users fall victim each year. The year 2020 witnessed
a substantial surge in phishing attacks, exacerbating the prevalence of this cyber threat [2]. The
global COVID-19 pandemic prompted many countries to implement financial assistance programs

Page 3 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929


Page 4 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

through government departments. Seizing this opportunity, cybercriminals utilized phishing


techniques to deceitfully gather sensitive personal information, enabling them to fraudulently
apply for government subsidies like unemployment benefits. Disturbingly, among the cyber-attack
complaints registered by the U.S. public in 2020, phishing-related complaints constituted the
highest proportion [2]. Further emphasizing the gravity of the situation, the APWG phishing
activity trends report for 2020 indicated a nearly twofold increase in the number of phishing attacks
throughout the year [3]. The pervasive and evolving nature of phishing highlights the need for
heightened awareness, robust cybersecurity measures, and ongoing efforts to protect individuals
and organizations from falling prey to these deceptive practices.

Anti-phishing strategies encompass a dual approach, involving both user education and technical
defense. This paper predominantly focuses on reviewing the technical defense methodologies that
have emerged in recent years, with a particular emphasis on the identification of phishing websites,
a pivotal step in the process of deceiving user information. Numerous academic research endeavors
and commercial products have been introduced to detect phishing websites. Traditional methods
often adopt list-based solutions, where valid and legitimate websites are compiled into whitelists,
and confirmed phishing sites are added to blacklists. These lists are then widely disseminated to
prevent other users from falling victim to attacks. This approach proves effective in preventing the
reuse of known phishing website URLs, thereby minimizing the number of affected users and
potential losses. Notably, these methods are commonly employed in real-time defensive actions
due to their low computational time cost, facilitated by single-string match algorithms. However,
a significant drawback of list-based solutions lies in their inability to promptly detect new phishing
URLs. This limitation exposes innocent users to potential attacks before the malicious link is added
to a blacklist. In response, some researchers have proposed rule-based methods designed to
identify novel fraudulent websites. This methodology involves leveraging the expertise of security
professionals and conducting in-depth analyses of phishing sites. Adhering to the W3C standard
for URL components, which include the protocol, subdomain, domain name, port, path, query,
parameters, and fragment, rules are generated. For instance, rules may be established based on the
similarity of the domain name to other legitimate domains. Some rules may necessitate third-party
services to obtain additional information, such as the registration date of the domain. However, it
is crucial to acknowledge that the dissemination of rules through technical articles poses a risk.
Cybercriminals can study these rules and adapt, creating new phishing URLs that deviate from the

Page 4 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929


Page 5 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

established patterns. Consequently, cybersecurity specialists have responded by developing more


sophisticated rules, some of which are derived from the source codes of web pages to enhance the
accuracy of phishing website detection. The ongoing evolution of technical defense strategies
remains essential in staying ahead of the dynamic landscape of phishing threats.

With the continuous evolution of machine learning techniques, a plethora of methodologies rooted
in machine learning has surfaced for the purpose of recognizing phishing websites, aiming to
enhance prediction performance. Phishing detection, as a supervised classification approach,
leverages labeled datasets to train models that can effectively classify data. Various algorithms are
employed in this supervised learning process, including but not limited to naïve Bayes, neural
networks, linear regression, logistic regression, decision trees, support vector machines, K-nearest
neighbor, and random forests. The development of practical products necessitates robust solutions
that must meet two key requirements. Firstly, there is a need for high accuracy coupled with a low
false warning rate. Achieving optimal model performance often hinges on the availability of
substantial datasets, particularly for complex structures like neural networks. Moreover, the
computational time factor is critical for the viability of real-time detection systems.

This paper primarily aims to survey effective methods for preventing phishing attacks in real-time
environments. It delineates the basic life cycle of a phishing attack, starting from the entry point,
with a specific focus on the phase when a user clicks on a phishing link. The paper employs
technical methods to identify the phishing link and promptly alert the user. In addition to
conventional approaches such as blacklist matching and recognition methods, this paper provides
an in-depth exploration of machine learning-based URL detection technology. By presenting state-
of-the-art solutions, the paper engages in a comparative analysis, highlighting the challenges and
limitations inherent in each solution. Furthermore, it offers insights into potential research
directions and future solutions. The main contributions of this paper lie in its comprehensive
review of effective anti-phishing methods, its analytical comparison of various solutions, and its
provision of valuable ideas for future research in this domain.

1. A phishing life cycle to clearly capture the phishing problem;

2. A survey of major datasets and data sources for phishing detection websites;

3. A state-of-the-art survey of machine learning-based solutions for detecting phishing websites.

Page 5 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929


Page 6 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

This paper is structured as follows. In Section 2, we delve into the background and related work
concerning phishing. Section 3 provides an exhaustive exploration of methodologies for detecting
website phishing, categorizing them into list-based methods, heuristic strategies, and machine
learning-based solutions. Specifically, we elucidate the intricate details of the general architecture
underlying phishing network detection solutions based on machine learning. Section 4 introduces
various frameworks employed in website phishing detection systems. Moving forward, Section 5
outlines cutting-edge machine learning-based solutions, organized into three distinct categories
based on the number and characteristics of the learning model. Section 6 is dedicated to a thorough
discussion of the challenges associated with detecting phishing attacks. Finally, in Section 7, we
draw conclusions from the insights gathered in this study. Each section contributes to the
overarching goal of presenting a comprehensive overview of effective methods, frameworks, and
state-of-the-art solutions in the realm of phishing detection, while also addressing the complexities
and obstacles inherent in this evolving field.

Page 6 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929


Page 7 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

2 Literature Review
Phishing stands as a prevalent cyberattack tactic wherein perpetrators send deceptive emails or
messages with the aim of tricking recipients into visiting fraudulent pages. The ultimate goal is to
illicitly gather sensitive user data, including usernames, passwords, and credit card numbers, for
financial exploitation. Figure 1 illustrates the typical phishing life cycle. In the initial stage, an
attacker creates a phishing website designed to closely resemble a legitimate counterpart. The
manipulation of the URL is a common strategy employed by attackers, involving tactics like
spelling errors, the use of similar alphabetic characters, and other methods to mimic the URL of
the authentic website. Notably, attackers often focus on imitating the domain name and network
resource directory. For instance, a deceptive link like "https://aimazon.amz-
z7acyuup9z0y16.xyz/v" (accessed on 9 January 2024) attempts to emulate the legitimate
https://www.amazon.com. While a computer's browser may reveal the actual URL by hovering
over the link, such distinctions can be challenging for the average user to discern visually or from
memory, making them susceptible to these imitative tactics. Beyond manipulating URLs, attackers
also emphasize the imitation of web content. This involves the use of scripts to extract logos, web
layouts, and text directly from authentic web pages. Cybercriminals commonly replicate form
submission pages that necessitate users to input sensitive information. Examples include forged
login pages, payment pages, and password recovery pages, all designed to dupe users into
divulging confidential data. This multifaceted approach underscores the sophistication and
deceptive nature of phishing attacks, exploiting both technical and psychological vulnerabilities of
unsuspecting users.

Page 7 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929


Page 8 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

Figure 1: Phishing life cycle.

The second phase of the phishing life cycle involves sending emails that strongly encourage
recipients to click on the provided link. Phishing links are not limited to emails but can also be
disseminated through SMS, voice messages, QR codes, and even spoofed mobile applications [4].
With the prevalent use of smartphones and social media, criminals have expanded their channels
for spreading false information. In these methods, text and images are commonly employed to
deceive individuals into clicking on the deceptive link. For instance, an attacker might impersonate
the customer service of a telecommunications company, sending emails that urge users to make
payments to prevent service disruptions. While scam emails are typically sent indiscriminately,
there is always a subset of users with lower defensive awareness who may fall victim to such
tactics. In this step, attackers leverage social engineering methods, including psychological
manipulation, to induce users into making security mistakes. Perpetrators are adept at instilling a
sense of fear and urgency while gaining the user's trust through text messages. Subsequently, users
click on the link, which redirects them to open a fraudulent website. Notably, real URL strings are
concealed before redirection to web browsers on mobile phones.

The subsequent step involves the collection of personal information on the fake website, designed
to closely resemble the legitimate web page of a company or organization. This mimicry includes
using a similar logo, name, user interface design, and content. This deceptive tactic is frequently
employed in processes like login, password reset, payment, and renewal of personal information.

Page 8 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929


Page 9 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

When users submit sensitive information on web servers established by attackers, criminals gain
access to all the data provided.

The final step entails stealing the user's account funds by utilizing the user's genuine information
to fabricate requests on a legitimate website. Some individuals use the same usernames and
passwords across multiple websites, allowing attackers to compromise multiple accounts
belonging to the user. In some instances, phishers leverage stolen data for further criminal
activities. Since the inception of phishing techniques documented in a paper in 1987 [5], these
methods have evolved with the development of the Internet. For instance, with the rise of online
payment systems, attackers have shifted their focus to online payment phishing. According to the
2020 Internet Crime Report, the Internet Crime Complaint Center (IC3) received 791,790
cyberattack complaints, with phishing scams accounting for around 30% of these, making them
the most reported type of cybercrime and causing over USD 54 million in losses [2]. Consequently,
for individuals navigating the Internet, the ability to distinguish between genuine and fraudulent
web pages is imperative. Visual tools are essential to aid users in identifying phishing websites.

2.1 Anti-Phishing
Indeed, as depicted in Figure 1, the phishing life cycle involves five distinct steps leading up to an
attacker stealing money from the user's account or utilizing the gathered information for other
malicious activities. Recognizing the potential impact at each stage underscores the significance
of preventing and mitigating phishing attacks. Consequently, thwarting any one of these steps can
effectively halt a phishing attack. Here, we delve into the method of anti-phishing, examining
preventive measures starting from each stage of the phishing life cycle. By addressing
vulnerabilities and implementing countermeasures at each phase, organizations and individuals
can enhance their resilience against these deceptive cyber threats.

2.2 Web Scrapping


While it may be challenging to entirely prevent perpetrators from creating deceptive web pages,
certain techniques can significantly increase the cost and complexity for attackers. One common
tactic employed by attackers involves the use of scripts to write crawlers, which automatically
extract content from legitimate web pages. Subsequently, the attackers intercept valuable
information and replicate it on phishing web pages. In response, legitimate websites can employ
various techniques to hinder web scraping, thereby impeding the automated extraction of data by

Page 9 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929


Page 10 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

attackers. One effective method involves obfuscation, which includes the use of CSS sprites to
display critical data. CSS sprites involve combining multiple images into a single image file and
then using CSS to display specific portions of that image as needed. This technique makes it more
challenging for automated scripts to decipher and extract relevant information, as the data is not
presented in a straightforward manner. Another strategy is to replace text with images. By
converting textual content into image files, legitimate websites introduce an additional layer of
complexity for scraping tools. Automated scripts find it harder to interpret and extract meaningful
data from images compared to plain text. This approach adds an extra barrier for attackers seeking
to collect information from web pages. By implementing these obfuscation techniques, legitimate
websites can introduce hurdles and increase the effort required for attackers to scrape content.
While not foolproof, these measures contribute to raising the costs and challenges associated with
creating fraudulent web pages through automated means.

2.2.1 Span Filter


Spam filtering techniques play a crucial role in identifying and mitigating unsolicited emails before
users have the chance to read or click on any embedded links. Many popular email services,
including Gmail, Yahoo, Outlook, and AOL, have incorporated spam filtering components as an
integral part of their platforms. These filters aim to distinguish between legitimate emails and
spam, enhancing user experience and security. Earlier spam filters relied on relatively
straightforward methods, such as blacklists or whitelists and empirical rules. However, with the
advancements in artificial intelligence technology, modern spam filters have evolved to integrate
intelligent prediction models based on machine learning. This incorporation of machine learning
allows filters to analyze and identify patterns associated with spam, even if specific instances are
not explicitly listed in predefined blacklists. For instance, Gmail utilizes a machine learning-based
spam filter that can effectively block around 100 million additional spam emails on a daily basis.
This dynamic approach enables the filter to adapt and improve its effectiveness over time, learning
from patterns and characteristics associated with spam emails. The integration of machine learning
enhances the ability of spam filters to detect and prevent new and evolving spam tactics, providing
users with a more robust defense against unwanted and potentially harmful email content.

Page 10 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929


Page 11 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

2.2.2 Detecting Fake Websites


When users unknowingly land on a phishing web page designed to mimic a legitimate website, it
can be challenging for many people to recall the authentic domain name, particularly in the case
of lesser-known start-ups or companies. As a result, users may struggle to identify phishing
websites solely based on the URL. To address this issue, some web browsers have integrated
security components designed to detect phishing or malware sites, providing users with warnings
when attempting to access potentially unsafe web pages. For example, Google Chrome
incorporates a security feature that displays warning messages when users visit websites deemed
unsafe. Google's initiative, Google Safe Browsing, launched in 2007, is an integral part of various
Google products, including Gmail and Google Search. This security component relies on a
blacklist containing URLs associated with malware or phishing attempts [7]. While web browsers
and extensions may offer protection based on blacklists or whitelists, these solutions are less
effective when dealing with previously unknown phishing websites. Fortunately, the rapid
advancements in artificial intelligence technology have introduced new ideas and solutions for
detecting phishing attacks. Predictive models based on machine learning have demonstrated the
capability to identify phishing links that are not present in whitelists and can adapt to circumvent
existing rules. This innovation enhances the ability to detect and combat phishing attacks in real-
time, offering a more dynamic and effective defense against evolving cyber threats.

2.2.3 Second Authorization Verification


Once an attacker successfully acquires the user's sensitive data, the subsequent step typically
involves utilizing this information to gain unauthorized access to the legitimate website,
manipulate the user's account, and carry out fraudulent activities, such as stealing funds. In
response to this threat, it becomes imperative for websites to implement additional steps to verify
the authenticity of the user, particularly when there are disparities in the IP address and device
information used during the login attempt compared to the user's typical usage patterns. To
enhance security, many websites incorporate extra verification measures that go beyond traditional
username and password authentication. These additional verifications often leverage dynamic and
biometric factors, which are more challenging for attackers to manipulate. Examples of such
measures include:

Page 11 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929


Page 12 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

Facial Movement Recognition: This involves analyzing facial features and movements in real-
time to authenticate the user.

Expression Recognition: This method assesses the user's facial expressions during the login
process to verify their identity.

Voiceprint Recognition: The user's unique vocal characteristics are analyzed and compared to a
stored voiceprint for authentication.

Implementing these dynamic and biometric verification methods adds an extra layer of security by
requiring users to prove their authenticity through characteristics that are difficult for attackers to
replicate or manipulate. This multi-factor authentication approach enhances the overall security
posture of online platforms and helps protect users from unauthorized access and fraudulent
activities.

2.3 Related Work


Numerous survey papers have been published, providing insights into various solutions for
detecting phishing websites. Basit et al. conducted a survey on artificial intelligence-based
phishing detection techniques, analyzing the trends and harms of phishing attacks through
statistical reports [8]. The paper delves into major communication media and target devices used
in phishing attacks, detailing various attack techniques. It primarily focuses on anti-phishing
measures, categorizing them into four sections: machine learning, deep learning, hybrid learning,
and scenario-based. Each section presents multiple major algorithms and conducts thorough
comparisons among them. Conclusions drawn from state-of-the-art solutions highlight the
widespread and effective use of machine learning-based solutions, the significant contribution of
feature selection to high-grade performance, the increased computing resources required for high
accuracy, and the recognition of the random forest model for achieving the highest accuracy.

In a review by Singh et al. on machine learning-based phishing detection [9], the authors provide
a concise history of phishing and major attack reports. Phishing attacks are categorized into social
engineering attacks and malware-based phishing. Features are classified into three categories—
source code features, URL features, and image features—all based on rules.

In 2020, Vijayalakshmi et al. presented a survey on major detection techniques and a taxonomy
for detecting phishing [10]. Utilizing a statistical report from APWG, the paper outlines the trends

Page 12 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929


Page 13 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

of phishing attacks from 2017 to 2019. The survey introduces a taxonomy of automated phishing
detection solutions, classifying them into three categories based on input parameters: web address-
based methods, webpage content-based solutions, and hybrid approaches. Web address-based
approaches are further divided into list-based, heuristic rule-based, and learning-based methods,
while web content-based solutions are separated into rule-based and machine learning-based
solutions. The authors list most state-of-the-art methodologies for each category, providing
detailed interpretations of each solution. Through a comparison of methods using various
evaluation metrics, including classification performance, limitations, third-party service
independence, and zero-hour attack detection, the authors suggest that hybrid approaches can
achieve high accuracy rates and are suitable for real-time systems. Additionally, they anticipate
that deep learning-based solutions will play a crucial role in the future.

Kalaharsha and Mehtre conducted a survey on phishing detection solutions, categorizing them
based on the applied techniques and input parameters [11]. The paper introduces various types of
phishing attacks and three phishing techniques. The authors outline 18 methods and 9 datasets for
detecting phishing websites, conducting a comparative analysis of accuracy performance across
all models. The paper also addresses challenges in the field, including the reduction of false-
positive rates and mitigating overfitting.

In a more recent survey, Jain and Gupta provide a comprehensive analysis of phishing attack
techniques, detection methods, and existing challenges [12]. They incorporate statistical reports
and motivations behind phishing attacks, presenting diverse phishing attack techniques targeting
both PCs and smartphones. The authors introduce various defense methods and compare existing
anti-phishing approaches published between 2006 and 2017, highlighting their advantages and
limitations. The survey also outlines major challenges, such as the selection of efficient features,
identification of tiny URLs, and the detection of phishing attacks on smartphones.

Page 13 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929


Page 14 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

3 Methodologies of Phishing Website Detection


Given that phishing is fundamentally a social engineering issue, effective countermeasures are
developed across various dimensions, encompassing education, legal supervision, and technical
approaches [4]. This survey specifically concentrates on technical strategies designed for the
detection of phishing websites. The methodologies for detecting phishing websites are categorized
into three primary groups: list-based, heuristic, and machine learning methods [13]. List-based
approaches comprise whitelists and blacklists, which are manually reported and authenticated by
systems. Whitelists consist of validated legitimate URLs or domains, while blacklists comprise
confirmed phishing websites. Once a user reports and verifies a website as phishing, the URL is
added to blacklists, serving to prevent other users from falling victim to such fraudulent sites.
Heuristic strategies employ a set of features extracted from the textual contents of a website to
identify phishing web pages. These features are then compared with those of legitimate websites.
The underlying idea is rooted in the observation that attackers typically deceive users by
mimicking well-known websites. Machine learning methods also leverage features from websites.
They involve building models that learn from structured data sets, enabling them to predict whether
a new website is a phishing website. In the realm of machine learning, detecting phishing websites
is approached as a classification problem, where the system classifies websites into phishing or
legitimate categories based on learned patterns and features.

3.1 List-Based Approaches


In 2016, Jain and Gupta introduced an auto-updated, whitelist-based approach for safeguarding
against phishing attacks on the client side. The experimental results showcased an impressive
86.02% accuracy, coupled with a minimal false-positive rate of less than 1.48%. The low false-
positive rate suggests a minimal occurrence of false warnings for potential phishing attacks.
Additionally, this approach offers the advantage of fast access times, ensuring real-time
responsiveness in various environments and products [14].

3.2 Heuristic Strategies


Tan et al. introduced a phishing detection approach named PhishWHO, comprising three distinct
phases. Firstly, it acquires identity keywords using a weighted URL token system and assembles
the N-gram model from the HTML of the webpage. Secondly, it utilizes these keywords in
mainstream search engines to locate the legitimate website and its corresponding legal domain.

Page 14 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929


Page 15 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

Subsequently, a comparison is made between the legal domain and the domain of the target website
to determine its phishing status [15].

In a different approach, Chiew et al. utilized a website's logo image as a distinguishing feature to
verify its legitimacy [16]. The authors employed machine learning algorithms to extract logos from
webpage images and then conducted a domain query using the logo as a keyword in the Google
search engine. This category is often referred to as a search engine-based approach due to its
reliance on search engines for verification.

3.3 Machine Learning-Based Methods


To combat dynamic phishing attacks with enhanced accuracy and lower false-positive rates
compared to alternative methods, machine learning-based countermeasures are proposed [4]. The
machine learning approach comprises six essential components: data collection, feature extraction,
model training, model testing, and prediction. Existing solutions for detecting phishing websites
based on machine learning adhere to this flowchart, optimizing one or more components to achieve
improved overall performance. This structured approach allows for the development of more
effective and efficient countermeasures against the evolving landscape of phishing attacks.

3.3.1 Data Collection and Feature Extraction


Data serves as the foundational element for each approach and plays a crucial role in influencing
performance. Two primary methods are employed for data collection: loading datasets already
published and directly pulling URLs from the Internet. Table 1 highlights several major data
sources. In the case of three published datasets, each row's data object contains various features
extracted from a URL along with a label indicating the class. Original URL strings can be gathered
from websites through open APIs or data mining scripts.

In a 2012 paper, Mohammad et al. introduced an automatic technique for feature extraction from
phishing websites, collecting 2500 phishing URLs from the phishTank archive [18]. They
extracted 17 features, categorized into address bar-based, abnormal-based, and HTML and
JavaScript-based features. Most features were automatically extracted from the URL and the web
page's source code without relying on third-party services. The age of the domain and DNS record,
however, was obtained from the WHOIS database [19], and the web page rank was sourced from
the Alexa database [20]. The authors proposed an IF-ELSE rule and assigned weights to each
feature based on its contribution to phishing link classification.

Page 15 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929


Page 16 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

In 2015, Mohammad et al. published a phishing website dataset on the UCI Machine Learning
Repository, becoming a foundational resource for machine learning-based phishing detection
solutions, containing 11,055 instances with 30 features [22]. Choon also contributed a phishing
dataset on Mendeley in 2018, containing 10,000 data rows with 48 features extracted from
phishTank and OpenPhish for phishing webpages. Published datasets are generally smaller in
comparison to other machine learning programs. Therefore, resampling techniques, such as N-fold
cross-validation, are often employed, splitting the data into N pieces iterated N times, with each
iteration selecting one piece as testing data and the others as training data.

Alternatively, some researchers opt to collect URLs directly from the Internet, obtaining them
from sources like phishTank, OpenPhish, and Spamhaus.org for phishing URLs, and
dmoztools.net, Alexa, and Common Crawl for legitimate websites. They then parse the features
themselves, contributing to a diverse array of data sources in the field.

With the successful development of natural language processing (NLP) techniques, many
researchers have adopted the capture of character-level features from URL strings based on NLP,
integrating them into deep learning models to enhance accuracy. This approach presents significant
advantages, including independence from cybersecurity expertise and the avoidance of reliance on
third-party network services [24].

Given that characters in URLs are continuous and lack semantic distinctions, character-level
features, such as character-level TF-IDF features, are utilized. TF-IDF, which stands for Term
Frequency–Inverse Document Frequency, operates at the character level, treating each character
as a term. The algorithm calculates the TF-IDF score for each character, generating a matrix
reflecting the relevance of each character in the URL string. Taking "https://www.google.com/"
(accessed on 18 July 2021) as an example, it comprises 17 characters ("h", "t", "t", "p", "s", ":",
"/", "/", "w", "w", "w", ".", "g", "o", "o", "g", "l", "e", ".", "c", "o", "m"), forming a character level
17-gram in the corpus. Consequently, it generates a vector with 17 TF-IDF scores. The TF-IDF
score for a single character is calculated using the mathematical formulation shown below:

𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑇𝑖𝑚𝑒𝑠 𝑐ℎ𝑎𝑟𝑎𝑐𝑡𝑒𝑟 𝑡 𝑎𝑝𝑝𝑒𝑎𝑟𝑠 𝑖𝑛 𝑎 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡 𝑑


𝑇𝐹(𝑡, 𝑑) =
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝐶ℎ𝑎𝑟𝑎𝑐𝑡𝑒𝑟𝑠 𝑖𝑛 𝑎 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡 𝑑

𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝐷


𝐼𝐷𝐹(𝑡, 𝐷) = log 𝑒 ( )
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝐷𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑤𝑖𝑡ℎ 𝐶ℎ𝑎𝑟𝑎𝑐𝑡𝑒𝑟 𝑡 𝑖𝑛 𝑖𝑡

Page 16 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929


Page 17 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

𝑇𝐹𝐼𝐷𝐹(𝑡, 𝑑, 𝐷) = 𝑇𝐹(𝑡, 𝑑) ∗ 𝐼𝐷𝐹 (𝑡, 𝐷)

Data Source Type Remarks


UCI {22} Published dataset 11,055 instances with 30 features
Mendeley (23) Published dataset 11,055 instances with 30 features
ISCX-URL-2016 (25) Published dataset 11,055 instances with 30 features
https://phishtank.com Website Valid Phishing URLs
https://openphish.com Website Valid Phishing URLs
https://commoncrawl.org/ Website Legitimate URLs
https://www.alexa.com/ Website Legitimate URLs
3.3.2 Feature Selection
Feature selection is a crucial step in machine learning models as it involves automatically
identifying and selecting the most relevant features that contribute significantly to the model's
performance. This process enhances model efficiency, reduces training time (especially in deep
learning models), and mitigates overfitting issues. Generally, feature selection methodologies can
be categorized into three main types: filter method, wrapper method, and embedded method.

Zamir et al. employed various techniques, including recursive feature elimination (RFE),
information gain (IG), and relief ranking, to eliminate redundant features in phishing detection.
Additionally, they introduced principal components analysis (PCA) for attribute analysis. IG
serves as an indicator, calculating the importance of features based on class probability, feature
probability, and class probability under a feature condition. RFE is a widely used algorithm for
feature reduction, eliminating the least essential features until the error rate meets expectations.

The relief ranking filter, another technique used by Shabudin et al., calculates feature value scores
by comparing the values of two adjacent data points using the nearest neighbor search algorithm.
It then sorts these scores to obtain feature value weights. After this process, they obtained 22
features with weights ranking and removed eight redundant features with zero scores.

Zabihimayvan and Doran applied Fuzzy Rough Set (FRS) theory to select crucial features from
the UCI dataset and Mendeley dataset for phishing detection. FRS is an extension of Rough Set
(RS) theory, which determines a decision boundary by assessing the equality of data points based
on certain features and the same classes. FRS is particularly suitable for datasets where features

Page 17 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929


Page 18 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

are discrete, as in the original UCI dataset. After nominalization, where features are converted to
continuous values between 0 and 1, FRS is applied.

El-Rashidy proposed a novel technique for feature selection in web phishing detection models in
2021. This method consists of two phases. The first phase calculates the absence impact of each
feature by training a random forest model with a new dataset that removes one feature at a time.
The second phase involves training and testing the model, starting with one feature and
progressively adding features from a ranked list. The feature subset with the highest accuracy is
selected. This method is effective for selecting the most impactful feature subset, but its
computational complexity and time consumption make it more suitable for small feature sizes and
single classifiers.

3.3.3 Modeling
Machine learning models can be categorized into three types: single classifiers, hybrid models,
and deep learning approaches. Hybrid models integrate multiple algorithms during the training
process. Detecting phishing websites is typically framed as a binary classification problem. Here
are some commonly used classification algorithms:

Support Vector Machine (SVM): This supervised learning algorithm classifies data points into
two sections, predicting new data points within each. It is suitable for linear binary classification,
using a hyperplane with N dimensions corresponding to the features. The primary idea is to
maximize the distance between data points and the segmentation hyperplane. For instance, when
utilizing the UCI dataset for SVM training in a phishing vs. legitimate scenario, a 29-dimensional
hyperplane is employed.

Decision Tree: This popular algorithm has a tree structure where each node represents a feature,
stems present feature values and possibilities, and the last node provides the result. Simpler tree
structures generally yield better performance, but deep trees may lead to overfitting.

Random Forest: An ensemble of decision trees used for classification and regression. It addresses
overfitting by combining or averaging individual tree outputs during training, resulting in generally
higher accuracy compared to standalone decision trees.

k-Nearest Neighbors (k-NN): A non-parametric classification algorithm making predictions by


identifying similar data points through distance calculations. Methods like Euclidean distance for

Page 18 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929


Page 19 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

continuous data and Hamming distance for discrete values are used. Notably, k-NN lacks a formal
training process, and each prediction may be time-consuming, making it less suitable for real-time
scenarios.

Bagging: Bagging, also known as bootstrap aggregating, is an ensemble meta-learning algorithm


designed to enhance the performance of other machine learning algorithms in classification and
regression tasks. The process involves dividing the original training dataset into N subsets using
bootstrapping techniques, resampling to generate datasets of the same size as the original, and then
executing classification in N iterations, which can be parallelized. The final step aggregates the
outputs of N classifiers through averaging or voting.

Naive Bayes: Naive Bayes is a probabilistic statistical algorithm grounded in Bayes' theorem,
known for its robust independence features. Bayes' theorem is a conditional probability theory,
and the naive Bayes classifier incorporates simplicity and independence principles.

In recent years, researchers have increasingly employed hybrid classification approaches in


phishing website detection to achieve superior performance and reduced computational times
compared to single classifiers. Hybrid models typically build upon a primary learner, incorporating
additional algorithms for feature selection or optimizing initialization parameters, such as hyper-
parameters for neural networks.

The rapid evolution of deep learning and the success of natural language processing (NLP) have
led to the development of various deep learning models for phishing detection. These models
extract information and sequential patterns from URL strings without relying on source code
features from webpage content. They don't require specialized cybersecurity knowledge and often
utilize third-party services to capture characteristics.

Some widely used deep learning algorithms include:

CNN (Convolutional Neural Network): CNN is a feedforward deep learning algorithm


extensively used in image classification. Its architecture consists of multiple layers, including
input, hidden layers with convolutional layers, pooling layers, and fully connected layers, and an
output layer.

Page 19 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929


Page 20 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

RNN (Recurrent Neural Network): RNN is a deep neural network equipped with an internal
memory function to handle sequences of varying lengths, making it particularly effective in text-
related tasks like text mining.

Training Time Training


Algorithm Interpretability Inputs
Complexity Data Size
Support Vector
𝑂(𝑛2 ) Medium Small Structure Data
Machine (SVM)
k-nearest 𝑂(𝑘𝑛𝑑) k=number of
Medium Small Structure Data
neighbors( K-NN) Neighbors
Decision Tree 𝑂(𝑛𝑑 log 𝑛) High Small Structure Data
𝑂(𝑘𝑛𝑑 log 𝑛)
Random Forest Medium Small Structure Data
k=number of Trees
Native Bays 𝑂(𝑛𝑑) High Small Structure Data
Compute the
Deep Neural Structure Data
activation of all Low Large
Networks or text data
neurons
Table 2 summarizes these algorithms based on a common dataset, with computational complexities
assessed using Big O notation. The interpretability of deep neural networks can be challenging due
to the difficulty in understanding the role of each neuron and the contribution of input features to
the model output. While traditional machine learning algorithms are user-friendly, deep neural
networks excel in handling text data, such as URL strings, but demand more extensive training
data for optimal performance.

3.3.4 Performance Evaluation


The assessment of performance occurred during the testing phase, involving the customary
division of the original dataset into training data (typically 80%) and test data (20%). When
scrutinizing the classifier's performance on the testing dataset, four key statistical metrics were
employed:

True Positives (TP): The count of positive data points correctly identified by the classifier.

True Negatives (TN): The count of negative data points correctly identified by the classifier.

Page 20 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929


Page 21 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

False Positives (FP): The count of negative data points erroneously labeled as positive by the
classifier.

False Negatives (FN): The count of positive data points mistakenly labeled as negative by the
model.

These metrics, encapsulated in Table 3, provide a comprehensive insight into how well the
classifier performs in accurately identifying positive and negative instances within the testing
dataset.

Page 21 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

You might also like