0% found this document useful (0 votes)

105 views21 pages

Survey of Machine Learning in Phishing Detection Research

This document provides an overview of machine learning techniques for detecting phishing websites. It begins with background on the prevalence of phishing attacks and importance of detecting phishing websites. It then discusses traditional list-based detection methods using blacklists and whitelists, noting their inability to identify new phishing sites quickly. The document proposes that rule-based and machine learning methods may help detect novel phishing websites faster. It intends to comprehensively survey the use of machine learning in phishing detection research.

Uploaded by

Hafizshoaib381

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

105 views21 pages

Survey of Machine Learning in Phishing Detection Research

Uploaded by

Hafizshoaib381

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Page 1 of 21 - Cover Page Submission ID trn:oid:::1:2798103929

Turnitin LLC
Across the Spectrum (A Comprehensive Survey of Machine
Learning in Phishing Detection Research).docx
Course

CLASS-A0F7C44B4E06842E

DeVry University

Document Details

Submission ID

trn:oid:::1:2798103929 19 Pages

Submission Date 5,893 Words

Jan 12, 2024, 7:55 AM CST

36,135 Characters

Download Date

Jan 12, 2024, 8:10 AM CST

File Name

Survey_of_Machine_Learning_in_Phishing_Detection_Research.docx

File Size

142.0 KB

Page 1 of 21 - Cover Page Submission ID trn:oid:::1:2798103929

Page 2 of 21 - AI Writing Overview Submission ID trn:oid:::1:2798103929

How much of this submission has been generated by AI?

45%
Caution: Percentage may not indicate academic misconduct. Review required.

It is essential to understand the limitations of AI detection before making decisions

about a student's work. We encourage you to learn more about Turnitin's AI detection
capabilities before using the tool.
of qualifying text in this submission has been determined to be
generated by AI.

Frequently Asked Questions

What does the percentage mean?

The percentage shown in the AI writing detection indicator and in the AI writing report is the amount of qualifying text within the
submission that Turnitin's AI writing detection model determines was generated by AI.

Our testing has found that there is a higher incidence of false positives when the percentage is less than 20. In order to reduce the
likelihood of misinterpretation, the AI indicator will display an asterisk for percentages less than 20 to call attention to the fact that
the score is less reliable.

However, the final decision on whether any misconduct has occurred rests with the reviewer/instructor. They should use the
percentage as a means to start a formative conversation with their student and/or use it to examine the submitted assignment in
greater detail according to their school's policies.

How does Turnitin's indicator address false positives?

Our model only processes qualifying text in the form of long-form writing. Long-form writing means individual sentences contained in paragraphs that make up a
longer piece of written work, such as an essay, a dissertation, or an article, etc. Qualifying text that has been determined to be AI-generated will be highlighted blue
on the submission text.

Non-qualifying text, such as bullet points, annotated bibliographies, etc., will not be processed and can create disparity between the submission highlights and the
percentage shown.

What does 'qualifying text' mean?

Sometimes false positives (incorrectly flagging human-written text as AI-generated), can include lists without a lot of structural variation, text that literally repeats
itself, or text that has been paraphrased without developing new ideas. If our indicator shows a higher amount of AI writing in such text, we advise you to take that
into consideration when looking at the percentage indicated.

In a longer document with a mix of authentic writing and AI generated text, it can be difficult to exactly determine where the AI writing begins and original writing
ends, but our model should give you a reliable guide to start conversations with the submitting student.

Disclaimer
Our AI writing assessment is designed to help educators identify text that might be prepared by a generative AI tool. Our AI writing assessment may not always be accurate (it may misidentify
both human and AI-generated text) so it should not be used as the sole basis for adverse actions against a student. It takes further scrutiny and human judgment in conjunction with an
organization's application of its specific academic policies to determine whether any academic misconduct has occurred.

Page 2 of 21 - AI Writing Overview Submission ID trn:oid:::1:2798103929

Page 3 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

Across the Spectrum: A Comprehensive Survey of Machine Learning

in Phishing Detection Research

1 Introduction:
The Internet has become an integral aspect of people's daily lives, playing a crucial role that is
hard to envision a world without. As of January 2021, the global digital population report indicates
a staggering 4.66 billion active Internet users worldwide, constituting 59.5% of the global
population. Notably, 92.6% of these users access the Internet through smartphones [1]. This
widespread connectivity has revolutionized various facets of life, including information exchange,
online shopping, communication, and professional tasks. The onset of the 2019 pandemic
prompted a significant shift from traditional offline services to online platforms, particularly in
industries like catering and retail. In this digitally immersed era, individuals frequently share
sensitive data online, ranging from usernames and passwords to personal information and credit
card details. Unfortunately, cybercriminals exploit various illicit methods to acquire such
information, subsequently engaging in unauthorized activities on the Internet. Network security
concerns have been present since the early days of the Internet's inception, evolving in tandem
with its development. The rapid evolution of network attack techniques poses significant
challenges to cybersecurity. Noteworthy categories of cybersecurity issues, classified based on
attack methods and forms, include denial-of-service attacks (DoS), man-in-the-middle attacks
(MitM), SQL injection, zero-day exploits, DNS tunneling, phishing, and malware. The dynamic
landscape of the Internet and its vulnerabilities necessitate ongoing efforts to enhance
cybersecurity measures and protect users from potential threats.

Phishing stands as a sophisticated network attack that skillfully blends social engineering tactics
with computer technology to illicitly acquire sensitive personal information from unsuspecting
users. This method involves cyber attackers enticing individuals to click on fraudulent links
through deceptive emails, SMS messages, or social media communications. With a history
spanning over 30 years, phishing continues to be a pervasive threat, leading to substantial
economic losses as a considerable number of users fall victim each year. The year 2020 witnessed
a substantial surge in phishing attacks, exacerbating the prevalence of this cyber threat [2]. The
global COVID-19 pandemic prompted many countries to implement financial assistance programs

Page 3 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

Page 4 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

through government departments. Seizing this opportunity, cybercriminals utilized phishing

techniques to deceitfully gather sensitive personal information, enabling them to fraudulently
apply for government subsidies like unemployment benefits. Disturbingly, among the cyber-attack
complaints registered by the U.S. public in 2020, phishing-related complaints constituted the
highest proportion [2]. Further emphasizing the gravity of the situation, the APWG phishing
activity trends report for 2020 indicated a nearly twofold increase in the number of phishing attacks
throughout the year [3]. The pervasive and evolving nature of phishing highlights the need for
heightened awareness, robust cybersecurity measures, and ongoing efforts to protect individuals
and organizations from falling prey to these deceptive practices.

Anti-phishing strategies encompass a dual approach, involving both user education and technical
defense. This paper predominantly focuses on reviewing the technical defense methodologies that
have emerged in recent years, with a particular emphasis on the identification of phishing websites,
a pivotal step in the process of deceiving user information. Numerous academic research endeavors
and commercial products have been introduced to detect phishing websites. Traditional methods
often adopt list-based solutions, where valid and legitimate websites are compiled into whitelists,
and confirmed phishing sites are added to blacklists. These lists are then widely disseminated to
prevent other users from falling victim to attacks. This approach proves effective in preventing the
reuse of known phishing website URLs, thereby minimizing the number of affected users and
potential losses. Notably, these methods are commonly employed in real-time defensive actions
due to their low computational time cost, facilitated by single-string match algorithms. However,
a significant drawback of list-based solutions lies in their inability to promptly detect new phishing
URLs. This limitation exposes innocent users to potential attacks before the malicious link is added
to a blacklist. In response, some researchers have proposed rule-based methods designed to
identify novel fraudulent websites. This methodology involves leveraging the expertise of security
professionals and conducting in-depth analyses of phishing sites. Adhering to the W3C standard
for URL components, which include the protocol, subdomain, domain name, port, path, query,
parameters, and fragment, rules are generated. For instance, rules may be established based on the
similarity of the domain name to other legitimate domains. Some rules may necessitate third-party
services to obtain additional information, such as the registration date of the domain. However, it
is crucial to acknowledge that the dissemination of rules through technical articles poses a risk.
Cybercriminals can study these rules and adapt, creating new phishing URLs that deviate from the

Page 4 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

Page 5 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

established patterns. Consequently, cybersecurity specialists have responded by developing more

sophisticated rules, some of which are derived from the source codes of web pages to enhance the
accuracy of phishing website detection. The ongoing evolution of technical defense strategies
remains essential in staying ahead of the dynamic landscape of phishing threats.

With the continuous evolution of machine learning techniques, a plethora of methodologies rooted
in machine learning has surfaced for the purpose of recognizing phishing websites, aiming to
enhance prediction performance. Phishing detection, as a supervised classification approach,
leverages labeled datasets to train models that can effectively classify data. Various algorithms are
employed in this supervised learning process, including but not limited to naïve Bayes, neural
networks, linear regression, logistic regression, decision trees, support vector machines, K-nearest
neighbor, and random forests. The development of practical products necessitates robust solutions
that must meet two key requirements. Firstly, there is a need for high accuracy coupled with a low
false warning rate. Achieving optimal model performance often hinges on the availability of
substantial datasets, particularly for complex structures like neural networks. Moreover, the
computational time factor is critical for the viability of real-time detection systems.

This paper primarily aims to survey effective methods for preventing phishing attacks in real-time
environments. It delineates the basic life cycle of a phishing attack, starting from the entry point,
with a specific focus on the phase when a user clicks on a phishing link. The paper employs
technical methods to identify the phishing link and promptly alert the user. In addition to
conventional approaches such as blacklist matching and recognition methods, this paper provides
an in-depth exploration of machine learning-based URL detection technology. By presenting state-
of-the-art solutions, the paper engages in a comparative analysis, highlighting the challenges and
limitations inherent in each solution. Furthermore, it offers insights into potential research
directions and future solutions. The main contributions of this paper lie in its comprehensive
review of effective anti-phishing methods, its analytical comparison of various solutions, and its
provision of valuable ideas for future research in this domain.

1. A phishing life cycle to clearly capture the phishing problem;

2. A survey of major datasets and data sources for phishing detection websites;

3. A state-of-the-art survey of machine learning-based solutions for detecting phishing websites.

Page 5 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

Page 6 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

This paper is structured as follows. In Section 2, we delve into the background and related work
concerning phishing. Section 3 provides an exhaustive exploration of methodologies for detecting
website phishing, categorizing them into list-based methods, heuristic strategies, and machine
learning-based solutions. Specifically, we elucidate the intricate details of the general architecture
underlying phishing network detection solutions based on machine learning. Section 4 introduces
various frameworks employed in website phishing detection systems. Moving forward, Section 5
outlines cutting-edge machine learning-based solutions, organized into three distinct categories
based on the number and characteristics of the learning model. Section 6 is dedicated to a thorough
discussion of the challenges associated with detecting phishing attacks. Finally, in Section 7, we
draw conclusions from the insights gathered in this study. Each section contributes to the
overarching goal of presenting a comprehensive overview of effective methods, frameworks, and
state-of-the-art solutions in the realm of phishing detection, while also addressing the complexities
and obstacles inherent in this evolving field.

Page 6 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

Page 7 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

2 Literature Review
Phishing stands as a prevalent cyberattack tactic wherein perpetrators send deceptive emails or
messages with the aim of tricking recipients into visiting fraudulent pages. The ultimate goal is to
illicitly gather sensitive user data, including usernames, passwords, and credit card numbers, for
financial exploitation. Figure 1 illustrates the typical phishing life cycle. In the initial stage, an
attacker creates a phishing website designed to closely resemble a legitimate counterpart. The
manipulation of the URL is a common strategy employed by attackers, involving tactics like
spelling errors, the use of similar alphabetic characters, and other methods to mimic the URL of
the authentic website. Notably, attackers often focus on imitating the domain name and network
resource directory. For instance, a deceptive link like "https://aimazon.amz-
z7acyuup9z0y16.xyz/v" (accessed on 9 January 2024) attempts to emulate the legitimate
https://www.amazon.com. While a computer's browser may reveal the actual URL by hovering
over the link, such distinctions can be challenging for the average user to discern visually or from
memory, making them susceptible to these imitative tactics. Beyond manipulating URLs, attackers
also emphasize the imitation of web content. This involves the use of scripts to extract logos, web
layouts, and text directly from authentic web pages. Cybercriminals commonly replicate form
submission pages that necessitate users to input sensitive information. Examples include forged
login pages, payment pages, and password recovery pages, all designed to dupe users into
divulging confidential data. This multifaceted approach underscores the sophistication and
deceptive nature of phishing attacks, exploiting both technical and psychological vulnerabilities of
unsuspecting users.

Page 7 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

Page 8 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

Figure 1: Phishing life cycle.

The second phase of the phishing life cycle involves sending emails that strongly encourage
recipients to click on the provided link. Phishing links are not limited to emails but can also be
disseminated through SMS, voice messages, QR codes, and even spoofed mobile applications [4].
With the prevalent use of smartphones and social media, criminals have expanded their channels
for spreading false information. In these methods, text and images are commonly employed to
deceive individuals into clicking on the deceptive link. For instance, an attacker might impersonate
the customer service of a telecommunications company, sending emails that urge users to make
payments to prevent service disruptions. While scam emails are typically sent indiscriminately,
there is always a subset of users with lower defensive awareness who may fall victim to such
tactics. In this step, attackers leverage social engineering methods, including psychological
manipulation, to induce users into making security mistakes. Perpetrators are adept at instilling a
sense of fear and urgency while gaining the user's trust through text messages. Subsequently, users
click on the link, which redirects them to open a fraudulent website. Notably, real URL strings are
concealed before redirection to web browsers on mobile phones.

The subsequent step involves the collection of personal information on the fake website, designed
to closely resemble the legitimate web page of a company or organization. This mimicry includes
using a similar logo, name, user interface design, and content. This deceptive tactic is frequently
employed in processes like login, password reset, payment, and renewal of personal information.

Page 8 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

Page 9 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

When users submit sensitive information on web servers established by attackers, criminals gain
access to all the data provided.

The final step entails stealing the user's account funds by utilizing the user's genuine information
to fabricate requests on a legitimate website. Some individuals use the same usernames and
passwords across multiple websites, allowing attackers to compromise multiple accounts
belonging to the user. In some instances, phishers leverage stolen data for further criminal
activities. Since the inception of phishing techniques documented in a paper in 1987 [5], these
methods have evolved with the development of the Internet. For instance, with the rise of online
payment systems, attackers have shifted their focus to online payment phishing. According to the
2020 Internet Crime Report, the Internet Crime Complaint Center (IC3) received 791,790
cyberattack complaints, with phishing scams accounting for around 30% of these, making them
the most reported type of cybercrime and causing over USD 54 million in losses [2]. Consequently,
for individuals navigating the Internet, the ability to distinguish between genuine and fraudulent
web pages is imperative. Visual tools are essential to aid users in identifying phishing websites.

2.1 Anti-Phishing
Indeed, as depicted in Figure 1, the phishing life cycle involves five distinct steps leading up to an
attacker stealing money from the user's account or utilizing the gathered information for other
malicious activities. Recognizing the potential impact at each stage underscores the significance
of preventing and mitigating phishing attacks. Consequently, thwarting any one of these steps can
effectively halt a phishing attack. Here, we delve into the method of anti-phishing, examining
preventive measures starting from each stage of the phishing life cycle. By addressing
vulnerabilities and implementing countermeasures at each phase, organizations and individuals
can enhance their resilience against these deceptive cyber threats.

2.2 Web Scrapping

While it may be challenging to entirely prevent perpetrators from creating deceptive web pages,
certain techniques can significantly increase the cost and complexity for attackers. One common
tactic employed by attackers involves the use of scripts to write crawlers, which automatically
extract content from legitimate web pages. Subsequently, the attackers intercept valuable
information and replicate it on phishing web pages. In response, legitimate websites can employ
various techniques to hinder web scraping, thereby impeding the automated extraction of data by

Page 9 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

Page 10 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

attackers. One effective method involves obfuscation, which includes the use of CSS sprites to
display critical data. CSS sprites involve combining multiple images into a single image file and
then using CSS to display specific portions of that image as needed. This technique makes it more
challenging for automated scripts to decipher and extract relevant information, as the data is not
presented in a straightforward manner. Another strategy is to replace text with images. By
converting textual content into image files, legitimate websites introduce an additional layer of
complexity for scraping tools. Automated scripts find it harder to interpret and extract meaningful
data from images compared to plain text. This approach adds an extra barrier for attackers seeking
to collect information from web pages. By implementing these obfuscation techniques, legitimate
websites can introduce hurdles and increase the effort required for attackers to scrape content.
While not foolproof, these measures contribute to raising the costs and challenges associated with
creating fraudulent web pages through automated means.

2.2.1 Span Filter

Spam filtering techniques play a crucial role in identifying and mitigating unsolicited emails before
users have the chance to read or click on any embedded links. Many popular email services,
including Gmail, Yahoo, Outlook, and AOL, have incorporated spam filtering components as an
integral part of their platforms. These filters aim to distinguish between legitimate emails and
spam, enhancing user experience and security. Earlier spam filters relied on relatively
straightforward methods, such as blacklists or whitelists and empirical rules. However, with the
advancements in artificial intelligence technology, modern spam filters have evolved to integrate
intelligent prediction models based on machine learning. This incorporation of machine learning
allows filters to analyze and identify patterns associated with spam, even if specific instances are
not explicitly listed in predefined blacklists. For instance, Gmail utilizes a machine learning-based
spam filter that can effectively block around 100 million additional spam emails on a daily basis.
This dynamic approach enables the filter to adapt and improve its effectiveness over time, learning
from patterns and characteristics associated with spam emails. The integration of machine learning
enhances the ability of spam filters to detect and prevent new and evolving spam tactics, providing
users with a more robust defense against unwanted and potentially harmful email content.

Page 10 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

Page 11 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

2.2.2 Detecting Fake Websites

When users unknowingly land on a phishing web page designed to mimic a legitimate website, it
can be challenging for many people to recall the authentic domain name, particularly in the case
of lesser-known start-ups or companies. As a result, users may struggle to identify phishing
websites solely based on the URL. To address this issue, some web browsers have integrated
security components designed to detect phishing or malware sites, providing users with warnings
when attempting to access potentially unsafe web pages. For example, Google Chrome
incorporates a security feature that displays warning messages when users visit websites deemed
unsafe. Google's initiative, Google Safe Browsing, launched in 2007, is an integral part of various
Google products, including Gmail and Google Search. This security component relies on a
blacklist containing URLs associated with malware or phishing attempts [7]. While web browsers
and extensions may offer protection based on blacklists or whitelists, these solutions are less
effective when dealing with previously unknown phishing websites. Fortunately, the rapid
advancements in artificial intelligence technology have introduced new ideas and solutions for
detecting phishing attacks. Predictive models based on machine learning have demonstrated the
capability to identify phishing links that are not present in whitelists and can adapt to circumvent
existing rules. This innovation enhances the ability to detect and combat phishing attacks in real-
time, offering a more dynamic and effective defense against evolving cyber threats.

2.2.3 Second Authorization Verification

Once an attacker successfully acquires the user's sensitive data, the subsequent step typically
involves utilizing this information to gain unauthorized access to the legitimate website,
manipulate the user's account, and carry out fraudulent activities, such as stealing funds. In
response to this threat, it becomes imperative for websites to implement additional steps to verify
the authenticity of the user, particularly when there are disparities in the IP address and device
information used during the login attempt compared to the user's typical usage patterns. To
enhance security, many websites incorporate extra verification measures that go beyond traditional
username and password authentication. These additional verifications often leverage dynamic and
biometric factors, which are more challenging for attackers to manipulate. Examples of such
measures include:

Page 11 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

Page 12 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

Facial Movement Recognition: This involves analyzing facial features and movements in real-
time to authenticate the user.

Expression Recognition: This method assesses the user's facial expressions during the login
process to verify their identity.

Voiceprint Recognition: The user's unique vocal characteristics are analyzed and compared to a
stored voiceprint for authentication.

Implementing these dynamic and biometric verification methods adds an extra layer of security by
requiring users to prove their authenticity through characteristics that are difficult for attackers to
replicate or manipulate. This multi-factor authentication approach enhances the overall security
posture of online platforms and helps protect users from unauthorized access and fraudulent
activities.

2.3 Related Work

Numerous survey papers have been published, providing insights into various solutions for
detecting phishing websites. Basit et al. conducted a survey on artificial intelligence-based
phishing detection techniques, analyzing the trends and harms of phishing attacks through
statistical reports [8]. The paper delves into major communication media and target devices used
in phishing attacks, detailing various attack techniques. It primarily focuses on anti-phishing
measures, categorizing them into four sections: machine learning, deep learning, hybrid learning,
and scenario-based. Each section presents multiple major algorithms and conducts thorough
comparisons among them. Conclusions drawn from state-of-the-art solutions highlight the
widespread and effective use of machine learning-based solutions, the significant contribution of
feature selection to high-grade performance, the increased computing resources required for high
accuracy, and the recognition of the random forest model for achieving the highest accuracy.

In a review by Singh et al. on machine learning-based phishing detection [9], the authors provide
a concise history of phishing and major attack reports. Phishing attacks are categorized into social
engineering attacks and malware-based phishing. Features are classified into three categories—
source code features, URL features, and image features—all based on rules.

In 2020, Vijayalakshmi et al. presented a survey on major detection techniques and a taxonomy
for detecting phishing [10]. Utilizing a statistical report from APWG, the paper outlines the trends

Page 12 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

Page 13 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

of phishing attacks from 2017 to 2019. The survey introduces a taxonomy of automated phishing
detection solutions, classifying them into three categories based on input parameters: web address-
based methods, webpage content-based solutions, and hybrid approaches. Web address-based
approaches are further divided into list-based, heuristic rule-based, and learning-based methods,
while web content-based solutions are separated into rule-based and machine learning-based
solutions. The authors list most state-of-the-art methodologies for each category, providing
detailed interpretations of each solution. Through a comparison of methods using various
evaluation metrics, including classification performance, limitations, third-party service
independence, and zero-hour attack detection, the authors suggest that hybrid approaches can
achieve high accuracy rates and are suitable for real-time systems. Additionally, they anticipate
that deep learning-based solutions will play a crucial role in the future.

Kalaharsha and Mehtre conducted a survey on phishing detection solutions, categorizing them
based on the applied techniques and input parameters [11]. The paper introduces various types of
phishing attacks and three phishing techniques. The authors outline 18 methods and 9 datasets for
detecting phishing websites, conducting a comparative analysis of accuracy performance across
all models. The paper also addresses challenges in the field, including the reduction of false-
positive rates and mitigating overfitting.

In a more recent survey, Jain and Gupta provide a comprehensive analysis of phishing attack
techniques, detection methods, and existing challenges [12]. They incorporate statistical reports
and motivations behind phishing attacks, presenting diverse phishing attack techniques targeting
both PCs and smartphones. The authors introduce various defense methods and compare existing
anti-phishing approaches published between 2006 and 2017, highlighting their advantages and
limitations. The survey also outlines major challenges, such as the selection of efficient features,
identification of tiny URLs, and the detection of phishing attacks on smartphones.

Page 13 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

Page 14 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

3 Methodologies of Phishing Website Detection

Given that phishing is fundamentally a social engineering issue, effective countermeasures are
developed across various dimensions, encompassing education, legal supervision, and technical
approaches [4]. This survey specifically concentrates on technical strategies designed for the
detection of phishing websites. The methodologies for detecting phishing websites are categorized
into three primary groups: list-based, heuristic, and machine learning methods [13]. List-based
approaches comprise whitelists and blacklists, which are manually reported and authenticated by
systems. Whitelists consist of validated legitimate URLs or domains, while blacklists comprise
confirmed phishing websites. Once a user reports and verifies a website as phishing, the URL is
added to blacklists, serving to prevent other users from falling victim to such fraudulent sites.
Heuristic strategies employ a set of features extracted from the textual contents of a website to
identify phishing web pages. These features are then compared with those of legitimate websites.
The underlying idea is rooted in the observation that attackers typically deceive users by
mimicking well-known websites. Machine learning methods also leverage features from websites.
They involve building models that learn from structured data sets, enabling them to predict whether
a new website is a phishing website. In the realm of machine learning, detecting phishing websites
is approached as a classification problem, where the system classifies websites into phishing or
legitimate categories based on learned patterns and features.

3.1 List-Based Approaches

In 2016, Jain and Gupta introduced an auto-updated, whitelist-based approach for safeguarding
against phishing attacks on the client side. The experimental results showcased an impressive
86.02% accuracy, coupled with a minimal false-positive rate of less than 1.48%. The low false-
positive rate suggests a minimal occurrence of false warnings for potential phishing attacks.
Additionally, this approach offers the advantage of fast access times, ensuring real-time
responsiveness in various environments and products [14].

3.2 Heuristic Strategies

Tan et al. introduced a phishing detection approach named PhishWHO, comprising three distinct
phases. Firstly, it acquires identity keywords using a weighted URL token system and assembles
the N-gram model from the HTML of the webpage. Secondly, it utilizes these keywords in
mainstream search engines to locate the legitimate website and its corresponding legal domain.

Page 14 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

Page 15 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

Subsequently, a comparison is made between the legal domain and the domain of the target website
to determine its phishing status [15].

In a different approach, Chiew et al. utilized a website's logo image as a distinguishing feature to
verify its legitimacy [16]. The authors employed machine learning algorithms to extract logos from
webpage images and then conducted a domain query using the logo as a keyword in the Google
search engine. This category is often referred to as a search engine-based approach due to its
reliance on search engines for verification.

3.3 Machine Learning-Based Methods

To combat dynamic phishing attacks with enhanced accuracy and lower false-positive rates
compared to alternative methods, machine learning-based countermeasures are proposed [4]. The
machine learning approach comprises six essential components: data collection, feature extraction,
model training, model testing, and prediction. Existing solutions for detecting phishing websites
based on machine learning adhere to this flowchart, optimizing one or more components to achieve
improved overall performance. This structured approach allows for the development of more
effective and efficient countermeasures against the evolving landscape of phishing attacks.

3.3.1 Data Collection and Feature Extraction

Data serves as the foundational element for each approach and plays a crucial role in influencing
performance. Two primary methods are employed for data collection: loading datasets already
published and directly pulling URLs from the Internet. Table 1 highlights several major data
sources. In the case of three published datasets, each row's data object contains various features
extracted from a URL along with a label indicating the class. Original URL strings can be gathered
from websites through open APIs or data mining scripts.

In a 2012 paper, Mohammad et al. introduced an automatic technique for feature extraction from
phishing websites, collecting 2500 phishing URLs from the phishTank archive [18]. They
extracted 17 features, categorized into address bar-based, abnormal-based, and HTML and
JavaScript-based features. Most features were automatically extracted from the URL and the web
page's source code without relying on third-party services. The age of the domain and DNS record,
however, was obtained from the WHOIS database [19], and the web page rank was sourced from
the Alexa database [20]. The authors proposed an IF-ELSE rule and assigned weights to each
feature based on its contribution to phishing link classification.

Page 15 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

Page 16 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

In 2015, Mohammad et al. published a phishing website dataset on the UCI Machine Learning
Repository, becoming a foundational resource for machine learning-based phishing detection
solutions, containing 11,055 instances with 30 features [22]. Choon also contributed a phishing
dataset on Mendeley in 2018, containing 10,000 data rows with 48 features extracted from
phishTank and OpenPhish for phishing webpages. Published datasets are generally smaller in
comparison to other machine learning programs. Therefore, resampling techniques, such as N-fold
cross-validation, are often employed, splitting the data into N pieces iterated N times, with each
iteration selecting one piece as testing data and the others as training data.

Alternatively, some researchers opt to collect URLs directly from the Internet, obtaining them
from sources like phishTank, OpenPhish, and Spamhaus.org for phishing URLs, and
dmoztools.net, Alexa, and Common Crawl for legitimate websites. They then parse the features
themselves, contributing to a diverse array of data sources in the field.

With the successful development of natural language processing (NLP) techniques, many
researchers have adopted the capture of character-level features from URL strings based on NLP,
integrating them into deep learning models to enhance accuracy. This approach presents significant
advantages, including independence from cybersecurity expertise and the avoidance of reliance on
third-party network services [24].

Given that characters in URLs are continuous and lack semantic distinctions, character-level
features, such as character-level TF-IDF features, are utilized. TF-IDF, which stands for Term
Frequency–Inverse Document Frequency, operates at the character level, treating each character
as a term. The algorithm calculates the TF-IDF score for each character, generating a matrix
reflecting the relevance of each character in the URL string. Taking "https://www.google.com/"
(accessed on 18 July 2021) as an example, it comprises 17 characters ("h", "t", "t", "p", "s", ":",
"/", "/", "w", "w", "w", ".", "g", "o", "o", "g", "l", "e", ".", "c", "o", "m"), forming a character level
17-gram in the corpus. Consequently, it generates a vector with 17 TF-IDF scores. The TF-IDF
score for a single character is calculated using the mathematical formulation shown below:

𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑇𝑖𝑚𝑒𝑠 𝑐ℎ𝑎𝑟𝑎𝑐𝑡𝑒𝑟 𝑡 𝑎𝑝𝑝𝑒𝑎𝑟𝑠 𝑖𝑛 𝑎 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡 𝑑

𝑇𝐹(𝑡, 𝑑) =
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝐶ℎ𝑎𝑟𝑎𝑐𝑡𝑒𝑟𝑠 𝑖𝑛 𝑎 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡 𝑑

𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝐷

𝐼𝐷𝐹(𝑡, 𝐷) = log 𝑒 ( )
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝐷𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑤𝑖𝑡ℎ 𝐶ℎ𝑎𝑟𝑎𝑐𝑡𝑒𝑟 𝑡 𝑖𝑛 𝑖𝑡

Page 16 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

Page 17 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

𝑇𝐹𝐼𝐷𝐹(𝑡, 𝑑, 𝐷) = 𝑇𝐹(𝑡, 𝑑) ∗ 𝐼𝐷𝐹 (𝑡, 𝐷)

Data Source Type Remarks

UCI {22} Published dataset 11,055 instances with 30 features
Mendeley (23) Published dataset 11,055 instances with 30 features
ISCX-URL-2016 (25) Published dataset 11,055 instances with 30 features
https://phishtank.com Website Valid Phishing URLs
https://openphish.com Website Valid Phishing URLs
https://commoncrawl.org/ Website Legitimate URLs
https://www.alexa.com/ Website Legitimate URLs
3.3.2 Feature Selection
Feature selection is a crucial step in machine learning models as it involves automatically
identifying and selecting the most relevant features that contribute significantly to the model's
performance. This process enhances model efficiency, reduces training time (especially in deep
learning models), and mitigates overfitting issues. Generally, feature selection methodologies can
be categorized into three main types: filter method, wrapper method, and embedded method.

Zamir et al. employed various techniques, including recursive feature elimination (RFE),
information gain (IG), and relief ranking, to eliminate redundant features in phishing detection.
Additionally, they introduced principal components analysis (PCA) for attribute analysis. IG
serves as an indicator, calculating the importance of features based on class probability, feature
probability, and class probability under a feature condition. RFE is a widely used algorithm for
feature reduction, eliminating the least essential features until the error rate meets expectations.

The relief ranking filter, another technique used by Shabudin et al., calculates feature value scores
by comparing the values of two adjacent data points using the nearest neighbor search algorithm.
It then sorts these scores to obtain feature value weights. After this process, they obtained 22
features with weights ranking and removed eight redundant features with zero scores.

Zabihimayvan and Doran applied Fuzzy Rough Set (FRS) theory to select crucial features from
the UCI dataset and Mendeley dataset for phishing detection. FRS is an extension of Rough Set
(RS) theory, which determines a decision boundary by assessing the equality of data points based
on certain features and the same classes. FRS is particularly suitable for datasets where features

Page 17 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

Page 18 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

are discrete, as in the original UCI dataset. After nominalization, where features are converted to
continuous values between 0 and 1, FRS is applied.

El-Rashidy proposed a novel technique for feature selection in web phishing detection models in
2021. This method consists of two phases. The first phase calculates the absence impact of each
feature by training a random forest model with a new dataset that removes one feature at a time.
The second phase involves training and testing the model, starting with one feature and
progressively adding features from a ranked list. The feature subset with the highest accuracy is
selected. This method is effective for selecting the most impactful feature subset, but its
computational complexity and time consumption make it more suitable for small feature sizes and
single classifiers.

3.3.3 Modeling
Machine learning models can be categorized into three types: single classifiers, hybrid models,
and deep learning approaches. Hybrid models integrate multiple algorithms during the training
process. Detecting phishing websites is typically framed as a binary classification problem. Here
are some commonly used classification algorithms:

Support Vector Machine (SVM): This supervised learning algorithm classifies data points into
two sections, predicting new data points within each. It is suitable for linear binary classification,
using a hyperplane with N dimensions corresponding to the features. The primary idea is to
maximize the distance between data points and the segmentation hyperplane. For instance, when
utilizing the UCI dataset for SVM training in a phishing vs. legitimate scenario, a 29-dimensional
hyperplane is employed.

Decision Tree: This popular algorithm has a tree structure where each node represents a feature,
stems present feature values and possibilities, and the last node provides the result. Simpler tree
structures generally yield better performance, but deep trees may lead to overfitting.

Random Forest: An ensemble of decision trees used for classification and regression. It addresses
overfitting by combining or averaging individual tree outputs during training, resulting in generally
higher accuracy compared to standalone decision trees.

k-Nearest Neighbors (k-NN): A non-parametric classification algorithm making predictions by

identifying similar data points through distance calculations. Methods like Euclidean distance for

Page 18 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

Page 19 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

continuous data and Hamming distance for discrete values are used. Notably, k-NN lacks a formal
training process, and each prediction may be time-consuming, making it less suitable for real-time
scenarios.

Bagging: Bagging, also known as bootstrap aggregating, is an ensemble meta-learning algorithm

designed to enhance the performance of other machine learning algorithms in classification and
regression tasks. The process involves dividing the original training dataset into N subsets using
bootstrapping techniques, resampling to generate datasets of the same size as the original, and then
executing classification in N iterations, which can be parallelized. The final step aggregates the
outputs of N classifiers through averaging or voting.

Naive Bayes: Naive Bayes is a probabilistic statistical algorithm grounded in Bayes' theorem,
known for its robust independence features. Bayes' theorem is a conditional probability theory,
and the naive Bayes classifier incorporates simplicity and independence principles.

In recent years, researchers have increasingly employed hybrid classification approaches in

phishing website detection to achieve superior performance and reduced computational times
compared to single classifiers. Hybrid models typically build upon a primary learner, incorporating
additional algorithms for feature selection or optimizing initialization parameters, such as hyper-
parameters for neural networks.

The rapid evolution of deep learning and the success of natural language processing (NLP) have
led to the development of various deep learning models for phishing detection. These models
extract information and sequential patterns from URL strings without relying on source code
features from webpage content. They don't require specialized cybersecurity knowledge and often
utilize third-party services to capture characteristics.

Some widely used deep learning algorithms include:

CNN (Convolutional Neural Network): CNN is a feedforward deep learning algorithm

extensively used in image classification. Its architecture consists of multiple layers, including
input, hidden layers with convolutional layers, pooling layers, and fully connected layers, and an
output layer.

Page 19 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

Page 20 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

RNN (Recurrent Neural Network): RNN is a deep neural network equipped with an internal
memory function to handle sequences of varying lengths, making it particularly effective in text-
related tasks like text mining.

Training Time Training

Algorithm Interpretability Inputs
Complexity Data Size
Support Vector
𝑂(𝑛2 ) Medium Small Structure Data
Machine (SVM)
k-nearest 𝑂(𝑘𝑛𝑑) k=number of
Medium Small Structure Data
neighbors( K-NN) Neighbors
Decision Tree 𝑂(𝑛𝑑 log 𝑛) High Small Structure Data
𝑂(𝑘𝑛𝑑 log 𝑛)
Random Forest Medium Small Structure Data
k=number of Trees
Native Bays 𝑂(𝑛𝑑) High Small Structure Data
Compute the
Deep Neural Structure Data
activation of all Low Large
Networks or text data
neurons
Table 2 summarizes these algorithms based on a common dataset, with computational complexities
assessed using Big O notation. The interpretability of deep neural networks can be challenging due
to the difficulty in understanding the role of each neuron and the contribution of input features to
the model output. While traditional machine learning algorithms are user-friendly, deep neural
networks excel in handling text data, such as URL strings, but demand more extensive training
data for optimal performance.

3.3.4 Performance Evaluation

The assessment of performance occurred during the testing phase, involving the customary
division of the original dataset into training data (typically 80%) and test data (20%). When
scrutinizing the classifier's performance on the testing dataset, four key statistical metrics were
employed:

True Positives (TP): The count of positive data points correctly identified by the classifier.

True Negatives (TN): The count of negative data points correctly identified by the classifier.

Page 20 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

Page 21 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

False Positives (FP): The count of negative data points erroneously labeled as positive by the
classifier.

False Negatives (FN): The count of positive data points mistakenly labeled as negative by the
model.

These metrics, encapsulated in Table 3, provide a comprehensive insight into how well the
classifier performs in accurately identifying positive and negative instances within the testing
dataset.

Page 21 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

Parkinson Detection Using Machine Learning Algorithms
No ratings yet
Parkinson Detection Using Machine Learning Algorithms
8 pages
Rainfall Prediction Using Machine Learning Algorithms A Comparative Analysis Approach
100% (1)
Rainfall Prediction Using Machine Learning Algorithms A Comparative Analysis Approach
4 pages
ML Unit 2
No ratings yet
ML Unit 2
25 pages
A Review On The Effectiveness of Machine Learning and Deep Learning Algorithms For Cyber Security
No ratings yet
A Review On The Effectiveness of Machine Learning and Deep Learning Algorithms For Cyber Security
19 pages
Detection of Cyber Attacks Using Ai
No ratings yet
Detection of Cyber Attacks Using Ai
92 pages
Face Recogniton For Attendance System
100% (1)
Face Recogniton For Attendance System
114 pages
Cybersecurity Book New 1
No ratings yet
Cybersecurity Book New 1
25 pages
Cyberspace News Prediction of Text and Image
No ratings yet
Cyberspace News Prediction of Text and Image
53 pages
Deep Reinforcement Learning For Cyber Security
No ratings yet
Deep Reinforcement Learning For Cyber Security
17 pages
Across The Spectrum In-Depth Review AI-Based Models For Phishing Detection
No ratings yet
Across The Spectrum In-Depth Review AI-Based Models For Phishing Detection
28 pages
Militant and Weapon Detection Final Report
No ratings yet
Militant and Weapon Detection Final Report
63 pages
Econ209 f2024 Lab 4 Truong Gia Han
No ratings yet
Econ209 f2024 Lab 4 Truong Gia Han
11 pages
Master'S Thesis: Potential Deep Learning Approaches For The Physical Layer
No ratings yet
Master'S Thesis: Potential Deep Learning Approaches For The Physical Layer
59 pages
Parkinsons Disease Prediction - Ieee
No ratings yet
Parkinsons Disease Prediction - Ieee
5 pages
Computer Science Project
No ratings yet
Computer Science Project
19 pages
Documentation (218609p)
No ratings yet
Documentation (218609p)
65 pages
Movie Ticketing System
No ratings yet
Movie Ticketing System
14 pages
Alzheimers Disease Detection Using Different Machine Learning Algorithms
100% (1)
Alzheimers Disease Detection Using Different Machine Learning Algorithms
7 pages
ML0101EN Clas K Nearest Neighbors CustCat Py v1
100% (1)
ML0101EN Clas K Nearest Neighbors CustCat Py v1
11 pages
Network Intrusion Detection System Using
No ratings yet
Network Intrusion Detection System Using
9 pages
Systematic Survey of Advanced Metering Infrastructure Security-Vulnerabilities Attacks Countermeasures and Future Vision
No ratings yet
Systematic Survey of Advanced Metering Infrastructure Security-Vulnerabilities Attacks Countermeasures and Future Vision
20 pages
The Role of 5G Networks
No ratings yet
The Role of 5G Networks
6 pages
Theory of Change
No ratings yet
Theory of Change
71 pages
Feature Selection For Machine Learning-Based Eraly Detection of Distributed Cyber Attacks
No ratings yet
Feature Selection For Machine Learning-Based Eraly Detection of Distributed Cyber Attacks
8 pages
Heart Disease Prediction Using Machine Learning-1
No ratings yet
Heart Disease Prediction Using Machine Learning-1
6 pages
Cyber Physical Systems: The Role of Machine Learning and Cyber Security in Present and Future
No ratings yet
Cyber Physical Systems: The Role of Machine Learning and Cyber Security in Present and Future
16 pages
Applications of Machine Learning To Optimize Tennis
No ratings yet
Applications of Machine Learning To Optimize Tennis
20 pages
HSE Manual (Rev 1) PDF
88% (8)
HSE Manual (Rev 1) PDF
17 pages
Ai Performance Sports
No ratings yet
Ai Performance Sports
2 pages
Long-Short Term Memory Network Based Model For Reverse Brute Force Attack Detection
No ratings yet
Long-Short Term Memory Network Based Model For Reverse Brute Force Attack Detection
12 pages
Artificial Neural Network (ANN)
No ratings yet
Artificial Neural Network (ANN)
34 pages
Tom Gates: Everything's Amazing (Sort Of) Chapter Sampler
69% (55)
Tom Gates: Everything's Amazing (Sort Of) Chapter Sampler
43 pages
Neural Network Based Rainfall Prediction System
100% (1)
Neural Network Based Rainfall Prediction System
6 pages
Enhancing Network Security in IoT Using Machine Learning - Based Anomaly Detection
No ratings yet
Enhancing Network Security in IoT Using Machine Learning - Based Anomaly Detection
5 pages
Computer Aided Technology Based On Graph Sample and Aggregate Attention Network Optimized For Soccer Teaching and Training
No ratings yet
Computer Aided Technology Based On Graph Sample and Aggregate Attention Network Optimized For Soccer Teaching and Training
18 pages
Abstract On The Artificial Intelegence
No ratings yet
Abstract On The Artificial Intelegence
15 pages
BSBXCS402 Assessment Task 2 V4 Digital Marketing PDF
No ratings yet
BSBXCS402 Assessment Task 2 V4 Digital Marketing PDF
61 pages
Review On Cyber-Physical Systems (2017)
No ratings yet
Review On Cyber-Physical Systems (2017)
14 pages
Spammer Detect Project Document
No ratings yet
Spammer Detect Project Document
45 pages
Face Detection and Smile Detection
No ratings yet
Face Detection and Smile Detection
8 pages
Machine Learning Paper-2
No ratings yet
Machine Learning Paper-2
4 pages
Spam Detection Using Machine Learning
No ratings yet
Spam Detection Using Machine Learning
4 pages
PDS - Sikaflex®-68 TF
No ratings yet
PDS - Sikaflex®-68 TF
4 pages
Child Cases
No ratings yet
Child Cases
66 pages
Orpheus Test
100% (1)
Orpheus Test
8 pages
Data Visualization - Day 4 - in Class Exercises - Dashboards and Story Points - Solution
No ratings yet
Data Visualization - Day 4 - in Class Exercises - Dashboards and Story Points - Solution
44 pages
Clustering Seasonal Performances of Soccer Teams Based On Situational Score Line
No ratings yet
Clustering Seasonal Performances of Soccer Teams Based On Situational Score Line
6 pages
Unsupervised Feature Extraction With Autoencoders For EEG Based Multiclass Motor Imagery BCI
No ratings yet
Unsupervised Feature Extraction With Autoencoders For EEG Based Multiclass Motor Imagery BCI
10 pages
Churn Modeling
100% (1)
Churn Modeling
11 pages
Correct Answer: Answer Key With Questions
No ratings yet
Correct Answer: Answer Key With Questions
9 pages
Stark Soren Central Asia and The Steppes
No ratings yet
Stark Soren Central Asia and The Steppes
28 pages
Machine Learning For Parkinson's Disease Prediction
No ratings yet
Machine Learning For Parkinson's Disease Prediction
8 pages
Multiple Choice Questions: Net/Set Preparation MCQ On Numerical Analysis by S. M. Chinchole
No ratings yet
Multiple Choice Questions: Net/Set Preparation MCQ On Numerical Analysis by S. M. Chinchole
25 pages
Whatsapp Chat Analyser
No ratings yet
Whatsapp Chat Analyser
11 pages
Collaborative Anomaly-Based Intrusion Detection IN Mobile Ad Hoc Networks
No ratings yet
Collaborative Anomaly-Based Intrusion Detection IN Mobile Ad Hoc Networks
4 pages
Hamlet Essay Revenge Thesis
100% (2)
Hamlet Essay Revenge Thesis
5 pages
Tutorials - Software Engineering
No ratings yet
Tutorials - Software Engineering
5 pages
NUREMBERG
No ratings yet
NUREMBERG
1 page
Japnoor Singh SOBTI Assessment 1 Student Questioning Written Knowledge Assessment AURAMA006 V2
No ratings yet
Japnoor Singh SOBTI Assessment 1 Student Questioning Written Knowledge Assessment AURAMA006 V2
22 pages
Bob Beck Protocol - Alternative Cancer Treatments
100% (5)
Bob Beck Protocol - Alternative Cancer Treatments
15 pages
Equity of Cybersecurity in the Education System: High Schools, Undergraduate, Graduate and Post-Graduate Studies.
From Everand
Equity of Cybersecurity in the Education System: High Schools, Undergraduate, Graduate and Post-Graduate Studies.
Joseph O. Esin
No ratings yet
Comprehensive Review On CNN-based Malware Detection With Hybrid Optimization Algorithm
No ratings yet
Comprehensive Review On CNN-based Malware Detection With Hybrid Optimization Algorithm
13 pages
Identification of HATE Speech Tweets in Pashto Language Using Machine Learning Techniques
No ratings yet
Identification of HATE Speech Tweets in Pashto Language Using Machine Learning Techniques
8 pages
Romote Workplace Cyber Security Awareness and Best Practice
No ratings yet
Romote Workplace Cyber Security Awareness and Best Practice
24 pages
Face Detection and Recognition Using Image Processing
No ratings yet
Face Detection and Recognition Using Image Processing
43 pages
Atkins 100 Menu Plan
No ratings yet
Atkins 100 Menu Plan
4 pages
A Study On Spam Classification Using Machine Learning Techniques
No ratings yet
A Study On Spam Classification Using Machine Learning Techniques
14 pages
How To Use Carrd
No ratings yet
How To Use Carrd
3 pages
Machine Learning and Deep Learning Methods For Cybersecurity
No ratings yet
Machine Learning and Deep Learning Methods For Cybersecurity
17 pages
Currency Recognition On Mobile Phones Proposed System Modules
No ratings yet
Currency Recognition On Mobile Phones Proposed System Modules
26 pages
Case Study 1
50% (2)
Case Study 1
6 pages
Should Parents Help Homework Elementary Schools
100% (1)
Should Parents Help Homework Elementary Schools
7 pages
HW1
100% (1)
HW1
8 pages
CTEC3424 2023 Coursework Packet
No ratings yet
CTEC3424 2023 Coursework Packet
10 pages
CTEC3424 2023 Coursework Packet
No ratings yet
CTEC3424 2023 Coursework Packet
10 pages
Machine Learning Techniques For Heart Disease Prediction: A. Lakshmanarao, Y.Swathi, P.Sri Sai Sundareswar
No ratings yet
Machine Learning Techniques For Heart Disease Prediction: A. Lakshmanarao, Y.Swathi, P.Sri Sai Sundareswar
4 pages
Math 131 Notes
No ratings yet
Math 131 Notes
63 pages
(IJETA-V8I5P1) :yew Kee Wong
No ratings yet
(IJETA-V8I5P1) :yew Kee Wong
5 pages
Intrusion Detection Systems Using Decision Trees and Support Vector Machines
No ratings yet
Intrusion Detection Systems Using Decision Trees and Support Vector Machines
16 pages
Health Lesson Plan (Arts)
100% (2)
Health Lesson Plan (Arts)
7 pages
Engineering Towards Industry 4.0 Using Data-Driven Methods
No ratings yet
Engineering Towards Industry 4.0 Using Data-Driven Methods
7 pages
Pathological Anatomy Introduction
No ratings yet
Pathological Anatomy Introduction
9 pages
Localisation in Humanitarian Practice: Localisation: A Long Overdue Shift
No ratings yet
Localisation in Humanitarian Practice: Localisation: A Long Overdue Shift
7 pages
Real-Time Face Detection On A "Dual-Sensor" Smart Camera Using Smooth-Edges Technique
No ratings yet
Real-Time Face Detection On A "Dual-Sensor" Smart Camera Using Smooth-Edges Technique
5 pages
6 Sensitive Periods in Montessori Curriculum
100% (2)
6 Sensitive Periods in Montessori Curriculum
2 pages
8 Speech 4TH
No ratings yet
8 Speech 4TH
2 pages
Biochemistry: Major Causes of Disease
No ratings yet
Biochemistry: Major Causes of Disease
2 pages
Ariba Minutes of Meeting 15042024
No ratings yet
Ariba Minutes of Meeting 15042024
4 pages
Borja Vs Sulyap Inc
No ratings yet
Borja Vs Sulyap Inc
3 pages
AWS Certified Developer Associate Examsample
No ratings yet
AWS Certified Developer Associate Examsample
2 pages
Reflection Impulsivity Theory Jerome Kagan
No ratings yet
Reflection Impulsivity Theory Jerome Kagan
5 pages
IGBT Module MG100Q2YS50 PDF
No ratings yet
IGBT Module MG100Q2YS50 PDF
8 pages
Class-1 IMO Set-2
100% (3)
Class-1 IMO Set-2
7 pages
Bollywood Exercise
No ratings yet
Bollywood Exercise
1 page
Install Saprouter
No ratings yet
Install Saprouter
3 pages
Amway Contract
No ratings yet
Amway Contract
1 page

Survey of Machine Learning in Phishing Detection Research

Uploaded by

Survey of Machine Learning in Phishing Detection Research

Uploaded by

Page 1 of 21 - Cover Page Submission ID trn:oid:::1:2798103929

Submission Date 5,893 Words

Jan 12, 2024, 7:55 AM CST

Jan 12, 2024, 8:10 AM CST

Page 1 of 21 - Cover Page Submission ID trn:oid:::1:2798103929

How much of this submission has been generated by AI?

It is essential to understand the limitations of AI detection before making decisions

Frequently Asked Questions

What does the percentage mean?

How does Turnitin's indicator address false positives?

What does 'qualifying text' mean?

Page 2 of 21 - AI Writing Overview Submission ID trn:oid:::1:2798103929

Across the Spectrum: A Comprehensive Survey of Machine Learning

Page 3 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

through government departments. Seizing this opportunity, cybercriminals utilized phishing

Page 4 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

established patterns. Consequently, cybersecurity specialists have responded by developing more

1. A phishing life cycle to clearly capture the phishing problem;

3. A state-of-the-art survey of machine learning-based solutions for detecting phishing websites.

Page 5 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

Page 6 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

Page 7 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

Figure 1: Phishing life cycle.

Page 8 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

2.2 Web Scrapping

Page 9 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

2.2.1 Span Filter

Page 10 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

2.2.2 Detecting Fake Websites

2.2.3 Second Authorization Verification

Page 11 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

2.3 Related Work

Page 12 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

Page 13 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

3 Methodologies of Phishing Website Detection

3.1 List-Based Approaches

3.2 Heuristic Strategies

Page 14 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

3.3 Machine Learning-Based Methods

3.3.1 Data Collection and Feature Extraction

Page 15 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑇𝑖𝑚𝑒𝑠 𝑐ℎ𝑎𝑟𝑎𝑐𝑡𝑒𝑟 𝑡 𝑎𝑝𝑝𝑒𝑎𝑟𝑠 𝑖𝑛 𝑎 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡 𝑑

𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝐷

Page 16 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

𝑇𝐹𝐼𝐷𝐹(𝑡, 𝑑, 𝐷) = 𝑇𝐹(𝑡, 𝑑) ∗ 𝐼𝐷𝐹 (𝑡, 𝐷)

Data Source Type Remarks

Page 17 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

k-Nearest Neighbors (k-NN): A non-parametric classification algorithm making predictions by

Page 18 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

Bagging: Bagging, also known as bootstrap aggregating, is an ensemble meta-learning algorithm

In recent years, researchers have increasingly employed hybrid classification approaches in

Some widely used deep learning algorithms include:

CNN (Convolutional Neural Network): CNN is a feedforward deep learning algorithm

Page 19 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

Training Time Training

3.3.4 Performance Evaluation

Page 20 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

Page 21 of 21 - AI Writing Submission Submission ID trn:oid:::1:2798103929

You might also like