Detecting Phishing Websites
Detecting Phishing Websites
Detecting Phishing Websites
I
Gayatri Vidya Parishad College of Engineering (Autonomous)
Visakhapatnam
CERTIFICATE
This report on
“Detecting Phishing websites”
is a bonafide record of the mini project work submitted
By
in their VII semester in partial fulfilment of the requirements for the Award of Degree of
Bachelor of Technology In
Computer Science and Engineering
during the academic year 2022-2023
II
Declaration of the student
I, the undersigned solemnly declare that this report is based on social service activities carried
out by me during the study (2019-23) for the partial fulfilment of the award of degree
Bachelor of Technology in Engineering.
I further declare that the proofs submitted are genuine to the best of my knowledge.
P. Triveni
M. Amarnadh Reddy
R. Sruthi
P. Yogesh
III
ACKNOWLEDGEMENT
We would like to express our deep sense of gratitude to our esteemed institute Gayatri
Vidya Parishad College of Engineering (Autonomous), which has provided us an
opportunity to fulfil our cherished desire.
We express our profound gratitude and our deep indebtedness to our guide Mrs. G. Vani ,
Assistant Professor, Department of Computer Science and Engineering, whose valuable
suggestions, guidance and comprehensive assistance helped us a lot in realizing my present
project.
We express our sincere thanks to our Principal Dr . A.B. KOTESWARA RAO, Gayatri
Vidya Parishad College of Engineering(Autonomous) for his encouragement to us during
this project, giving us a chance to explore and learn in the form of mini project.
Finally, we would also like to thank all the members of the teaching and non-teaching staff of
the Computer Science and Engineering Department for all their support in completion of our
project.
POPURI TRIVENI
IV
ROKALLA SRUTHI
PANGA YOGESH
ABSTRACT
In recent years as the use of the mobiles has increased, there has been an increasing tendency
to move almost all real-world operations to the cyber world. This makes our daily lives
easier, but due to the anonymous structure of the Internet, it also introduces many security
gaps with it. Used antivirus programs and firewall systems can prevent most attacks.
However, experienced attackers target computer user weaknesses by attempting to phish
computer users on fake websites. Scam websites appear to be genuine websites with logos
and graphics of this website. Phishing attacks are used to steal sensitive information such as
user IDs, passwords, bank accounts and credit card numbers. These pages imitate popular
banks, social media, e-commerce and more. This project aims to detect fraud or phishing on
websites using deep learning algorithms. Phishing problems are huge and there is no single
solution that can effectively minimise all vulnerabilities, so multiple methods are used. One
of the approaches uses shape-based analysis to authenticate the website. We use deep
learning techniques and algorithms to evaluate different features of URLs and websites.
V
INDEX
CHAPTER 1. INTRODUCTION 1
1.1 Objective 2
1.2 About the Algorithm 2
1.3 Purpose 3
1.4 Scope 3
VI
5.3.2 Model 18
CHAPTER 7. DEVELOPMENT 30
7.1 Datasets used 30
7.2 Sample Code 32
7.3 Results 42
CHAPTER 8. TESTING 47
8.1 Introduction to testing 47
8.2 Test Code 51
8.3 Test Cases 53
CHAPTER 9. CONCLUSION 54
VII
CHAPTER-1
INTRODUCTION
Phishing is a fraudulent technique that uses social and technological tricks to steal customer
identification and financial credentials. Social media systems use spoofed emails from
legitimate companies and agencies to enable users to use fake websites to divulge financial
details like usernames and passwords .Hackers install malicious software on computers to
steal credentials, often using systems to intercept username and passwords of consumers’
online accounts. Phishers use multiple methods, including email, Uniform Resource Locators
(URL), instant messages, forum postings, telephone calls, and text messages to steal user
information. The structure of phishing content is similar to the original content and trick users
to access the content in order to obtain their sensitive data. The primary objective of phishing
is to gain certain personal information for financial gain or use of identity theft. Phishing
attacks are causing severe economic damage around the world. Moreover, Most phishing
attacks target financial/payment institutions and webmail, according to the Anti-Phishing
Working Group (APWG) latest Phishing pattern studies.
In order to receive confidential data, criminals develop unauthorised replicas of a real website
and email, typically from a financial institution or other organisation dealing with financial
data . This email is rendered using a legitimate company’s logos and slogans. The design and
structure of HTML allow copying of images or an entire website . Also, it is one of the
factors for the rapid growth of the Internet as a communication medium, and enables the
misuse of brands, trademarks and other company identifiers that customers rely on as
authentication mechanisms. To trap users, Phisher sends "spooled" mails to as many people
as possible. When these emails are opened, the customers tend to be diverted from the
legitimate entity to a spoofed website.
1
1.1 PROJECT OBJECTIVE:
A phishing website (sometimes called a "spoofed" site, or spam URL) tries to steal your
account password or other confidential information by tricking you into believing you're on a
legitimate website. You could even land on a phishing site by mistyping a URL (web
address) The URLs of phishing web sites usually differs from the ordinary web sites in many
aspects such as: The url length of the fishing websites is usually long. Those
URLs have unusual symbols like '@'. Also those URLs have relatively more number of
dots(.). This project aims at classifying the phishing web sites data set by using the decision
tree classification.
2
Figure 1.2.1.1 Multilayer Perceptron
Multilayer Perceptron falls under the category of feedforward algorithms, because inputs are
combined with the initial weights in a weighted sum and subjected to the activation function,
just like in the Perceptron. But the difference is that each linear combination is propagated to
the next layer. Each layer is feeding the next one with the result of their computation, their
internal representation of the data. This goes all the way through the hidden layers to the
output layer. But it has more to it. If the algorithm only computed the weighted sums in each
neuron, propagated results to the output layer, and stopped there, it wouldn’t be able to learn
the weights that minimise the cost function. If the algorithm only computed one iteration,
there would be no actual learning.
1.3 VISION/PURPOSE:
Phishing is a form of fraud in which the attacker tries to learn sensitive information such as
login credentials or account information by sending as a reputable entity or person in email or
other communication channels. The purpose of Phishing Domain Detection is detecting
phishing domain names. Therefore, passive queries related to the domain name, which we
want to classify as phishing or not, provide useful information to us. The project aims at
developing a tool for Detection of Phishing Websites using Deep Learning Algorithms with
all the above mentioned advantages.
MISSION:
3
This tool is developed by using Python along with its layout toolkit PyQt, PyUIC and
python’s sklearn.
The project uses the 'sklearn' module of Python to perform the multi layer perceptron
classification and decision tree analysis.Multi Layer Perceptron Classifier relies on an
underlying Neural Network to perform the task of classification. A multilayer perceptron
(MLP) is a feedforward artificial neural network model that maps sets of input data onto a set
of appropriate outputs. Multilayer Perceptron Classifier is characterised by input layer
Features, no .of hidden layers, training set classes, no. of training iterations, learning rate and
activation function.. Decision tree build classification or Multi Layer Perceptron models in
the form of a tree structure. It breaks down a dataset into smaller and smaller subsets while at
the same time an associated decision tree is incrementally developed. The final result is a tree
with decision nodes and leaf nodes. A decision node has two or more branches. Decision
trees can handle both categorical and numerical data.
4
CHAPTER-2
SRS DOCUMENT
5
system, its behaviour, and outputs. It can be a calculation, data manipulation, business
process, user interaction, or any other specific functionality which defines what function a
system is likely to perform. Functional Requirements are also called Functional Specification.
➢ The project should be useful in classifying the phishing web sites data set
using decision tree classification and Multi Layer Perceptron Classification.
➢ The project should be useful to the newbies of social media to protect them
against phishing URLs.
➢ This project should finally lead to the improvement of quality of the world
wide web.
➢ To use latest Decision tree & Multi Layer Perceptron modules of the sklearn
tool.
➢ To use Python, which is chosen as the best programming language, by the
Programming Community . More Functionality can be implemented with less no .of
lines of code in Python.
➢ To use PyQt tool is to create the Graphical User interfaces.
➢ All the Front end code is to be generated automatically by PyUIC.
6
2. Python Qt Designer for designing user interface.
4. Pyuic for converting the layout designed user interface (UI) to python code.
7
CHAPTER-3
ANALYSIS
To evaluate the performance of the feature set, the feature set has been trained and cross-
validated against many different parameter combinations. In the multi feed-forward network,
we must gather data based on feature sets and then tune the parameters to achieve maximum
accuracy in phishing site classification.
It is an essential process in which training networks must set parameters and validate across
appropriate values. After attaining the right value, phishing sites can easily be classified with
the highest probability.
We used Python programming language along with the TensorFlow library to implement
deep learning algorithms.
Each layer is composed of the basic computing unit, i.e the neuron. The neuron is inspired by
the biological neuron that performs mathematical functions for the storage of information.
8
This information is transmitted to another neuron, and therefore information propagates in the
neural network. A neuron’s general mathematical representation is
The number of neurons in the input layer depends upon the dimension of datasets or
equivalently to the number of features of the dataset, i.e., X∈R^lk
Let l = {0,1,2,3,4,5} be the layers in my deep learning model, beY^(l-1) the input to layers
{1,2,3,4,5}, Y^(l) be output value of layer, where W^(l) is weight of layer i that is used for
linear transformation of inputs from n layers to output of m layers, B^(l) be bias of layer i and
F^(l) be the associated activation function of each layer. Y^(l) is nothing but the input layer
and Y^(l) is output layer.
where * is for matrix multiplication. W values were initialised with Xavier Initialization, and
B was initialised with zero. W and B are updated after each iteration in the backpropagation
method. Layer 0 is the input layer, layer 6 is output layer 6 and layers 1–5 are hidden layers
activated with the ReLU function provided by
9
where i represents ith iteration and l represents the lth layer. The intermediate output of our
model Y* is obtained as follows using the sigmoid activation function:
where l=6 in case of output layer. The loss function (l(Y*,Y ))over entire dataset is defined as
sum of cross-entropy between model output and actual output, which is shown as follows:
where Y* is an intermediate output of entire dataset obtained after processing it through deep
learning model and y*∈(0,1) jth row of Y* while Y^ is an actual label of our dataset and Yj
∈{0,1}is jth row of Y*, where 0 represents a legitimate site and 1 indicates phishing site.
The loss function given earlier is optimized using the Adam Optimizer at every epoch to
update parameters and train deep neural model using the backpropagation algorithmThe
functional formations represent these features without overfitting, because DNN has 5 hidden
layers along with 1 input and 1 output layer.
❖ The MLP is a special case of a feedforward neural network where every layer
is a fully connected layer, MLP concept is used in generic form, in loosely form
means a feedforward ANN, and more accurately is used for multiple layers of
perceptions.
❖ MLP consists of sequential layers of function compositions, the raw data enter
from the input layer, then it will generate the input for the next layer .
❖ The output of the hidden layer will be input for the output layer to apply the
final function, each layer consists of a set of nodes or ‘neurons’, the node receives the
input from the previous layer by applying an activation function, the activation
10
function is the identity function for linear regression and the logistic or sigmoid
function for logistic regression.
❖ By expanding the MLP network in depth and width the function flexibility
will be increased. In our experiments, we try to increase the depth and width to find
out the best network structure to improve the proposed model performance.
To determine if the product is technically and financially feasible to develop, is the main aim
of the feasibility study activity. A feasibility study should provide management with enough
information to decide:
There are various types of feasibility that can be determined. They are
Operational - Define the urgency of the problem and the acceptability of any solution,
includes people-oriented and social issues: internal issues, such as manpower problems, labor
objections, manager resistance, organisational conflicts, and policies; also, external issues,
including social acceptability, legal aspects, and government regulations.
Technical: Is the feasibility within the limits of current technology? Does the technology
exist at all? Is it available within a given resource?
11
Economic - Is the project possible, given resource constraints? Are the benefits that will
accrue from the new system worth the costs? What are the savings that will result from the
system, including tangible and intangible ones? What are the development and operational
costs?
Schedule - Constraints on the project schedule and whether they could be reasonably met.
It is also helpful for the decision-makers to decide the planned scheme processed latter or
now, depending on the financial condition of the organisation. This evaluation process also
studies the price benefits of the proposed scheme. Economic Feasibility also performs the
following tasks.
3.3.2 Technical Feasibility: A large part of determining resources has to do with assessing
technical feasibility. It considers the technical requirements of the proposed project. The
technical requirements are then compared to the technical capability of the organisation. The
systems project is considered technically feasible if the internal technical capability is
sufficient to support the project requirements.
The analyst must find out whether current technical resources can be where the expertise of
system analysts is beneficial, since using their own experience and their contact with vendors
they will be able to answer the question of technical feasibility. Technical Feasibility also
performs the following tasks.
12
o Is the technology available within the given resource constraints? o Is
required?
The operational feasibility refers to the availability of the operational resources needed to
extend research results beyond on which they were developed and for which all the
operational requirements are minimal and easily accommodated.
In addition, the operational feasibility would include any rational compromises farmers make
in adjusting the technology to the limited operational resources available to them. The
operational Feasibility also perform the tasks like o Does the current mode of operation
provide adequate response time? o Does the current of operation make maximum use of
resources.
Our project operates with a processor and packages installed are supported by the system.
13
● The cost of the hardware and software for the class of application
being considered.
● The benefits in the form of reduced cost.
● The proposed system will give the minute information, as a result.
● Performance is improved which in turn may be expected to provide
increased profits.
● This feasibility checks whether the system can be developed with the
available funds.
● This can be done economically if planned judicially, so it is
economically feasible.
14
15
CHAPTER-4
SOFTWARE DESCRIPTION
4.1 Anaconda-Navigator:
Anaconda is an open source and free distribution of R and Python programming language for
machine learning as well as data science projects. Therefore, it is known as a professional
data science platform. It contains a powerful environment manager, which provides a
different type of Python environment such as a Spyder, Jupyter notebook, and so on.
Navigator can search for packages on Anaconda Cloud or in a local Anaconda Repository. It
is available for Windows, macOS, and Linux.
4.2 Sk Learn:
Scikit-Learn is a free machine learning library for Python. It supports both supervised and
unsupervised machine learning, providing diverse algorithms for classification, regression,
clustering, and dimensionality reduction.
The library is built using many libraries you may already be familiar with, such as NumPy
and SciPy. It also plays well with other libraries, such as Pandas and Seaborn.
It represents the library provides access to many different datasets, one of which is the
famous iris dataset. The dataset is so famous that it’s often referred to as the “hello world” of
deep learning.
4.3 Pandas:
16
The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and
was created by Wes McKinney in 2008.
Pandas is a Python library. Pandas is used to analyse data. Pandas is a Python library used for
working with data sets. It has functions for analysing, cleaning, exploring, and manipulating
data. It uses Pandas that allows us to analyse big data and make conclusions based on
statistical theories. Pandas can clean messy data sets, and make them readable and relevant.
Relevant data is very important in data science.
4.4 Qt Designer:
Qt Designer is a Qt tool that provides you with a what-you-see-is-what-you-get
(WYSIWYG) user interface to create GUIs for your PyQt applications productively and
efficiently. The PyQt installer comes with a GUI builder tool called Qt Designer.
Using its simple drag and drop interface, a GUI interface can be quickly built without having
to write the code.
It is, however, not an IDE such as Visual Studio. Hence, Qt Designer does not have the
facility to debug and build the application.
4.5 Matplotlib:
Matplotlib is a plotting library for python. Matplotlib is an amazing visualisation library in
Python for 2D plots of arrays. Matplotlib is a multi-platform data visualisation library built
on NumPy arrays and designed to work with the broader SciPy stack.
Matplotlib comes with a wide variety of plots. Plots help to understand trends, patterns, and
to make correlations. It consists of various plots like line, bar, scatter etc. pyplot () is the most
important function in matplotlib library, which is used to plot 2D data.
4.6 Numpy:
● NumPy stands for Numerical Python. NumPy is a python library used for
working with arrays. It also has functions for working in the domain of linear algebra,
Fourier transform, and matrices.
17
● It is optimised to work with the latest CPU architectures. NumPy arrays are
stored at one continuous place in memory unlike lists, so processes can access and
manipulate them very efficiently.
4.7 Streamlit:
Streamlit is an open source python based framework for developing and
deploying interactive data science dashboards and machine learning
models.
It is built ontop of Python and supports many of the mainstream Python
libraries such as matplotlib, plotly and pandas.
Streamlit makes it easy for you to visualize, mutate, and share data.
4.8
18
CHAPTER-5
PROJECT DESCRIPTION
The project aims at developing a tool for Detection of Phishing Websites using Deep
Learning Algorithms
Mission:
This tool is developed by using Python along with its layout toolkit PyQt, PyUIC and
python’s sklearn. We use multilayer perceptron to detect phishing websites
1. Collection of datasets.
Qt Designer is the tool used to create the Graphical user interfaces in this project. The tool
can be found in the Library/Bin folder of Anaconda tool as shown in the following
screenshot.
19
Fig 5.3.1.1 Qt designer window
● As this designer tool is used again and again in the project, it’s better to create a
shortcut on the desktop, by using a right-click on designer.exe, as shown in the
following figure.
20
5.3.2 MODEL
Deep learning
Deep learning, also called deep structured learning, is part of a broader family of
machines based on Artificial neural networks learning methods with Representation
learning. Learning can be Supervised, Unsupervised and Semi-supervised. Deep
learning is an increasingly popular subset of machine learning. Deep learning models
are built using neural networks. A neural network takes in inputs, which are then
processed in hidden layers using weights that are adjusted during training. Then the
model spits out a prediction. The weights are adjusted to find patterns in order to make
better predictions. In deep learning, a computer model learns to perform classification
tasks directly from images, text, or sound. Deep learning models can achieve state-of-
the-art accuracy, learnings can be supervised, semi supervised, unsupervised
sometimes exceeding human- level performance. Models are trained by using a large
set of labelled data. architectures such as deep neural networks have been applied to
fields including computer vision, speech recognition, NLP, audio recognition and other
fields.
22
CHAPTER-6
SYSTEM DESIGN
6.1 Introduction to UML
Unified Modelling Language (UML) is a general-purpose modelling language. The
main aim of UML is to define a standard way to visualise the way a system has been
designed. It is quite like blueprints used in other fields of engineering. UML is not a
programming language, it is rather a visual language. We use UML diagrams to
portray the behaviour and structure of a system. UML helps software engineers,
businessmen and system architects with modelling, design and analysis. The Object
Management Group (OMG) adopted Unified Modelling Language as a standard in
1997. It's been managed by OMG ever since. The International Organisation for
Standardisation (ISO) published UML as an approved standard in 2005. UML has been
revised over the years and is reviewed periodically.
23
Fig 6.1.1 flow chart
● These things are the basic object-oriented building blocks of the UML.
You use them to write well-formed models.
Structural Things
Structural things are the nouns of UML models. These are the mostly static parts of a model,
representing elements that are either conceptual or physical. Collectively, the structural things
are called classifiers.
A class is a description of a set of objects that share the same attributes, operations,
relationships, and semantics. A class implements one or more interfaces. Graphically, a
class is rendered as a rectangle, usually including its name, attributes, and operations
24
● Class - A Class is a set of identical things that outlines the functionality and
properties of an object. It also represents the abstract class whose
functionalities are not defined. Its notation is as follows
25
Behavioural Things
● Behavioural things are the dynamic parts of UML models. These are
the verbs of a model, representing behaviour over time and space. In all, there
are three primary kinds of behavioural things
1. Interaction
2. State machine 1) Interaction
● It is a behaviour that comprises a set of messages exchanged among a
set of objects or roles within a particular context to accomplish a specific
purpose.
● The behaviour of a society of objects or of an individual operation may
be specified with an interaction.
● An interaction involves a number of other elements, including
messages, actions, and connectors (the connection between objects).
● Graphically, a message is rendered as a directed line, almost always
including the name of its operation.
2) State machine
its substates.
26
Grouping Things
Annotational Things
● Annotational things are the explanatory parts of UML models. These are
the comments you may apply to describe, illuminate, and remark about any
element in a model.
● There is one primary kind of annotational thing, called a note. A note is
simply a symbol for rendering constraints and comments attached to an element or
a collection of elements.
27
2. Association
● Association is basically a set of links that connects the elements of a
UML model.
● It also describes how many objects are taking part in that relationship.
3. Generalisation
● It is a specialization/generalisation relationship in which the
specialized element (the child) builds on the specification of the generalised
element (the parent).
● The child shares the structure and the behaviour of the parent.
Graphically, a generalisation relationship is rendered as a solid line with
a hollow arrowhead pointing to the parent
4. Realisation
● Realisation can be defined as a relationship in which two elements are
connected.
● One element describes some responsibility, which is not implemented
and the other one implements them.
● This relationship exists in the case of interfaces.
28
➢ Class diagram
➢ Object diagram
➢ Component diagram
➢ Composite structure diagram
➢ Use case diagram
➢ Sequence diagram
➢ Communication diagram
➢ State diagram
Activity diagram
➢
29
Features,Website Characteristics as input, and then followed by the calculation of the
accuracies of Multi Layer Perceptron Classification and Decision Tree Analysis.
ACTIVITY DIAGRAM
30
Fig 6.3.3:activity diagram
Activity diagram is another important diagram in UML to describe dynamic aspects of the
system. Activity diagram is basically a flowchart to represent the flow from one activity to
another activity. The activity can be described as an operation of the system. So, the control
flow is drawn from one operation to another. In the activity diagram we can see that first
provide the dataset with URL Features and website Characteristics and then split the dataset
into training,test sets, perform the calculation of Multi Layer Perceptron Classification
accuracy & Decision Tree analysis accuracy along with the predictions.
31
CHAPTER-7
DEVELOPMENT
32
33
7.2 SAMPLE CODE
Phisdet1.py
def wschar(self):
os.system("python sitechar1.py")
def urlprp(self):
os.system("python urlprop1.py")
if __name__ == "__main__":
app = QtWidgets.QApplication(sys.argv)
myapp = MyForm()
myapp.show()
sys.exit(app.exec_())
34
Urlprop1.py
import sys import os from urlprop import *
from PyQt5 import QtWidgets, QtGui,
QtCore import sqlite3 con =
sqlite3.connect('phisdet1')
def insertvalues(self):
with con:
cur = con.cursor() uid =
str(self.ui.lineEdit_9.text()) s1 =
str(self.ui.lineEdit_3.text()) s2 =
str(self.ui.lineEdit_4.text()) s3 =
str(self.ui.lineEdit_5.text()) s4 =
str(self.ui.lineEdit_6.text()) s5 =
str(self.ui.lineEdit_2.text()) s6 =
str(self.ui.lineEdit_7.text()) s7 =
str(self.ui.lineEdit_8.text()) s8 =
str(self.ui.lineEdit_10.text())
cur.execute('INSERT INTO
urlpr
ops(uid,s1,s2,s3,s4,s5,s6,s7,s8)
VALUES(?,?,?,?,?,?,?,?,?)',(uid,s1,s2,s3,s4,s5,s6,s7,s8)) con.commit()
if __name__ == "__main__":
app =
QtWidgets.QApplication(sys.argv)
myapp = MyForm() myapp.show()
sys.exit(app.exec_())
35
Sitechar1.py
#This program gets two values from a DB into lineEdits.
import sys import os from sitechar import *
from PyQt5 import QtWidgets, QtGui,
QtCore
def insertvalues(self):
with con:
cur = con.cursor() uid =
str(self.ui.lineEdit_9.text()) s1 =
str(self.ui.lineEdit_4.text()) s2 =
str(self.ui.lineEdit_5.text()) s3 =
str(self.ui.lineEdit_6.text()) s4 =
str(self.ui.lineEdit_7.text()) s5 =
str(self.ui.lineEdit_8.text())
cur.execute('INSERT INTO
sitechars(uid,s1,s2,s3,s4,s5)
VALUES(?,?,?,?,?,?)',(uid,s1,s2,s3,s4,s5))
con.commit()
if __name__ == "__main__":
app =
QtWidgets.QApplication(sys.argv)
myapp = MyForm() myapp.show()
sys.exit(app.exec_())
36
Datasubset1.py
#This plot is generated by considering a subset of 500 rows, and the first and
last columns from the dataset. import numpy as np import matplotlib.pyplot
as plt np.random.seed(6) import math import pandas as pd df =
pd.read_csv('phisset.csv') df["Website_Char1"] =
df["Website_Char1"].map({'MysteriousLinks':1,'NoMysteriousLinks':0})
df["Phis_Probability"] =
df["Phis_Probability"].map({'Medium':1,'Low':0,'High':2}) cnt1 =
(df["Website_Char1"] == 1).sum() #Number of '1's in first column cnt2 =
(df["Website_Char1"] == 0).sum() #Number of '0's in first column cnt3 =
(df["Phis_Probability"] == 0).sum() #Number of '0's in last column cnt4 =
(df["Phis_Probability"] == 1).sum() #Number of '1's in last column cnt5 =
(df["Phis_Probability"] == 2).sum()
Mlpc1.py
37
from sklearn.metrics import accuracy_score
df = pd.read_csv('phisset.csv')
df["Website_Char1"=df["Website_Char1"].map({'MysteriousLinks':1,'NoMyste
riousLinks':0})
df["Website_Char2"] =
df["Website_Char2"].map({'NonWorkingLinks':1,'WorkingLinks':0})
df["Website_Char3"]=df["Website_Char3"].map({'LessThanOneYear':1,'More
T hanOneYear':0})
df["Website_Char4"]=df["Website_Char4"].map({'DiffDomainFavicon':1,'Sam
e DomainFavicon':0}) df["Website_Char5"]=
df["Website_Char5"].map({'SlowDownPerf':1,'NoSlowDownPerf':0})
df["Url_Prop1"] =
df["Url_Prop1"].map({'LessThan54':-1,'54to75':0,'MoreThan75':1})
df["Url_Prop2"]=df["Url_Prop2"].map({'DifferentPositionedSlash':1,'CorrectPo
sitionedSlash':0})
df["Url_Prop3"] =
df["Url_Prop3"].map({'IPAddressPart':1,'NoIPAddressPart':0})
df["Url_Prop4"] =
df["Url_Prop4"].map({'ShortendURL':1,'NoShortendURL':0})
df["Url_Prop5"] = df["Url_Prop5"].map({'@symbol':1,'No@symbol':0})
df["Url_Prop6"] = df["Url_Prop6"].map({'DashPresent':1,'DashNotPresent':0})
df["Url_Prop7"] =
df["Url_Prop7"].map({'2dots':-1,'3dots':0,'MoreThan3dots':1})
df["Url_Prop8"] =
df["Url_Prop8"].map({'HTTPSNotAdded':1,'HTTPSAdded':0})
df["Phis_Probability"] =
df["Phis_Probability"].map({'Medium':1,'Low':0,'High':2})
df[["Website_Char1","Website_Char2","Website_Char3","Website_Char4","W
e
bsite_Char5","Url_Prop1","Url_Prop2","Url_Prop3","Url_Prop4","Url_Prop5",
38
"Url_Prop6","Url_Prop7","Url_Prop8","Phis_Probability"]].to_numpy() inputs
= data[:,:-1] outputs = data[:, -1] training_inputs = inputs[:1800]
training_outputs = outputs[:1800] testing_inputs = inputs[1800:] testing_outputs
= outputs[1800:] classifier = MLPClassifier()
classifier.fit(training_inputs, training_outputs)
predictions = classifier.predict(testing_inputs)
accuracy = 100.0 * accuracy_score(testing_outputs, predictions)
print ("The accuracy of MLPC on testing data is: " + str(accuracy))
testSet = [[1,1,1,1,1,1,0,0,1,1,1,1,1]] test = pd.DataFrame(testSet)
predictions = classifier.predict(test)
print('MLPC Prediction on the first test set is:',predictions)
testSet = [[1,1,0,1,0,1,0,1,1,1,0,0,0]] test =
pd.DataFrame(testSet) predictions = classifier.predict(test)
print('MLPC Prediction on the second test set is:',predictions)
testSet = [[0,1,1,1,0,1,0,1,0,1,0,1,1]] test =
pd.DataFrame(testSet) predictions = classifier.predict(test)
print('MLPC Prediction on the third test set is:',predictions)
print("Note: 0 indicates Low probable Phishing Site,1 indicates Medium
probable Phishing Site,2 indicates High probable Phishing Site ")
Dt1.py
from sklearn import tree import numpy as np import pandas as pd from
sklearn import * from sklearn.metrics import accuracy_score df =
pd.read_csv('phisset.csv') df["Website_Char1"] =
df["Website_Char1"].map({'MysteriousLinks':1,'NoMysteriousLinks':0})
df["Website_Char2"] =
df["Website_Char2"].map({'NonWorkingLinks':1,'WorkingLinks':0})
df["Website_Char3"] =
df["Website_Char3"].map({'LessThanOneYear':1,'MoreThanOneYear':0}
39
) df["Website_Char4"] =
df["Website_Char4"].map({'DiffDomainFavicon':1,'SameDomainFavicon
':0}) df["Website_Char5"] =
df["Website_Char5"].map({'SlowDownPerf':1,'NoSlowDownPerf':0})
df["Url_Prop1"] = df["Url_Prop1"].map({'LessThan54':-
1,'54to75':0,'MoreThan75':1}) df["Url_Prop2"] =
df["Url_Prop2"].map({'DifferentPositionedSlash':1,'CorrectPositionedSla
sh':0}) df["Url_Prop3"] =
df["Url_Prop3"].map({'IPAddressPart':1,'NoIPAddressPart':0})
df["Url_Prop4"] =
df["Url_Prop4"].map({'ShortendURL':1,'NoShortendURL':0})
df["Url_Prop5"] = df["Url_Prop5"].map({'@symbol':1,'No@symbol':0})
df["Url_Prop6"] =
df["Url_Prop6"].map({'DashPresent':1,'DashNotPresent':0})
df["Url_Prop7"] = df["Url_Prop7"].map({'2dots':-
1,'3dots':0,'MoreThan3dots':1}) df["Url_Prop8"] =
df["Url_Prop8"].map({'HTTPSNotAdded':1,'HTTPSAdded':0})
df["Phis_Probability"] =
df["Phis_Probability"].map({'Medium':1,'Low':0,'High':2}) data =
df[["Website_Char1","Website_Char2","Website_Char3","Website_Char
4","We
bsite_Char5","Url_Prop1","Url_Prop2","Url_Prop3","Url_Prop4","Url_P
rop5",
"Url_Prop6","Url_Prop7","Url_Prop8","Phis_Probability"]].to_numpy()
inputs = data[:,:-1] outputs = data[:, -1] training_inputs = inputs[:1800]
training_outputs = outputs[:1800] testing_inputs = inputs[1800:]
testing_outputs = outputs[1800:] classifier = tree.DecisionTreeClassifier()
classifier.fit(training_inputs, training_outputs) predictions =
classifier.predict(testing_inputs) accuracy = 100.0 *
accuracy_score(testing_outputs, predictions) print ("The accuracy of
Decision Tree on testing data is: " + str(accuracy)) testSet =
[[1,1,1,1,1,1,1,1,1,1,1,1,1]] test = pd.DataFrame(testSet)
predictions = classifier.predict(test) print('DT Prediction
on the first test set is:',predictions) testSet =
[[0,0,0,0,0,0,0,0,0,0,0,0,0]] test = pd.DataFrame(testSet)
predictions = classifier.predict(test) print('DT Prediction
on the second test setnis:',predictions) testSet =
[[0,1,0,1,0,1,0,1,0,1,0,1,0]] test = pd.DataFrame(testSet)
40
predictions = classifier.predict(test) print('DT Prediction
on the third test set is:',predictions)
print("Note: 0 indicates Low probable Phising Site,1 indicates Medium
probable Phising Site,2 indicates High probable Phising Site ")
Streamlit Code:
import streamlit as st
from sklearn.neural_network import MLPClassifier
import numpy
import pandas as pd
from sklearn.metrics import accuracy_score
df = pd.read_csv('phisset.csv')
df["Website_Char1"] = df["Website_Char1"].map({'MysteriousLinks': 1,
'NoMysteriousLinks': 0})
df["Website_Char2"] = df["Website_Char2"].map({'NonWorkingLinks': 1,
'WorkingLinks': 0})
df["Website_Char3"] = df["Website_Char3"].map({'LessThanOneYear': 1,
'MoreThanOneYear': 0})
df["Website_Char4"] = df["Website_Char4"].map({'DiffDomainFavicon': 1,
'SameDomainFavicon': 0})
df["Website_Char5"] = df["Website_Char5"].map({'SlowDownPerf': 1,
'NoSlowDownPerf': 0})
df["Url_Prop1"] = df["Url_Prop1"].map({'LessThan54': -1, '54to75': 0,
'MoreThan75': 1})
df["Url_Prop2"] = df["Url_Prop2"].map({'DifferentPositionedSlash': 1,
'CorrectPositionedSlash': 0})
df["Url_Prop3"] = df["Url_Prop3"].map({'IPAddressPart': 1,
'NoIPAddressPart': 0})
df["Url_Prop4"] = df["Url_Prop4"].map({'ShortendURL': 1, 'NoShortendURL':
0})
df["Url_Prop5"] = df["Url_Prop5"].map({'@symbol': 1, 'No@symbol': 0})
df["Url_Prop6"] = df["Url_Prop6"].map({'DashPresent': 1, 'DashNotPresent':
0})
df["Url_Prop7"] = df["Url_Prop7"].map({'2dots': -1, '3dots': 0,
'MoreThan3dots': 1})
df["Url_Prop8"] = df["Url_Prop8"].map({'HTTPSNotAdded': 1, 'HTTPSAdded':
0})
df["Phis_Probability"] = df["Phis_Probability"].map({'Medium': 1, 'Low': 0,
'High': 2})
data = df[
["Website_Char1", "Website_Char2", "Website_Char3", "Website_Char4",
"Website_Char5", "Url_Prop1", "Url_Prop2",
"Url_Prop3",
"Url_Prop4", "Url_Prop5", "Url_Prop6", "Url_Prop7", "Url_Prop8",
"Phis_Probability"]].to_numpy()
41
web_char_1 = st.number_input("Website containing any mysterious links :
Enter 0 for No 1 for Yes", 0, 1, 0)
web_char_2 = st.number_input("Website containing any Non-Working links :
Enter 0 for No 1 for Yes", 0, 1, 0)
web_char_3 = st.number_input("Website existing for less than one year :
Enter 0 for No 1 for Yes", 0, 1, 0)
web_char_4 = st.number_input("Favicon loaded from different domain : Enter
0 for No 1 for Yes", 0, 1, 0)
web_char_5 = st.number_input("Website slows down the performance of the
explorer: Enter 0 for No 1 for Yes", 0, 1, 0)
URL_prop_1 = st.number_input("URL length : Enter -1 if length <54; 0 if
54<length<75 ; 1 if length>75", -1, 1, 0)
URL_prop_2 = st.number_input("URL containing double slash at different
Position: Enter 0 for No 1 for Yes", 0, 1, 0)
URL_prop_3 = st.number_input("IP address is used as a part of URL : Enter 0
for No 1 for Yes", 0, 1, 0)
URL_prop_4 = st.number_input("Shortened URL : Enter 0 for No 1 for Yes", 0,
1, 0)
URL_prop_5 = st.number_input("URL having @ symbol : Enter 0 for No 1 for
Yes",0,1,0)
URL_prop_6 = st.number_input("Dash symbol is present in URL : Enter 0 for
No 1 for Yes", 0, 1, 0)
URL_prop_7 = st.number_input("No. of dots in the URL : Enter -1 for 3
dots ; 0 for 3 dots ; 1 for more than 3", 0,
1, 0)
URL_prop_8 = st.number_input("HTTPS is not added on the domain part of the
URL : Enter 0 for No 1 for Yes", 0, 1, 0)
classifier = MLPClassifier()
classifier.fit(training_inputs, training_outputs) # classifier to learn
fom data.
predictions = classifier.predict(testing_inputs)
42
testdata = [web_char_1, web_char_2, web_char_3, web_char_4, web_char_5,
URL_prop_1, URL_prop_2, URL_prop_3,
URL_prop_4,
URL_prop_5, URL_prop_6, URL_prop_7, URL_prop_8]
t_data = [testdata]
test_phish = pd.DataFrame(t_data)
predict = classifier.predict(test_phish)
if predict==0:
st.subheader('As per the input given,MLPC classifier classifies
the website as a low phishing website')
elif predict==1:
st.subheader('As per the information,MLPC classifier classifies
the website as a medium phishing website')
elif predict==2:
st.subheader('As per the information,MLPC classifier classifies
the website as a high phishing website')
43
7.3 RESULTS
phisdet1.py
44
Sitechar1.py
45
Mlpc1.py and Dt1.py
46
Streamlit results:
47
48
CHAPTER-8
TESTING
match the expected results and to ensure that the software system is Detect free. It involves
the execution of a software component or system component to evaluate one or more
properties of interest.It is required for evaluating the system. This phase is the critical phase
of software quality assurance and presents the ultimate view of coding.
Importance of Testing
The importance of software testing is imperative. A lot of times this process is skipped,
therefore the product and business might suffer. To understand the importance of testing, here
are some of the key points to explain
➔ Software testing saves money
➔ Provides security
➔ Improves product quality
➔ Customer Satisfaction
Testing is of different ways. The main idea behind testing is to reduce the errors and to do it
with a minimum time and effort.
Benefits of Testing
49
quality product is delivered to customers.
Customer Satisfaction: The main aim of any product is to give satisfaction to their
customers. UI/UX Testing ensures the best user experience.
Unit Testing: Unit tests are very low level, close to the source of your application. They
consist of testing individual methods and functions of the classes , components or modules
used by your software.Unit tests are in general quite cheap to automate and can be run very
Integration Testing: Integration tests verify that different modules or services used by your
application work well together. For example, it can be testing the interaction with the
database or making sure that microservices work together as expected. These types of tests
are more expensive to run as they require multiple parts of the application to be up and
running.
They only verify the output of an action and do not check the intermediate states of the
system when performing that action. There is sometimes a confusion between integration
tests and functional tests as they both require multiple components to interact with each other.
The difference is that an integration test may simply verify that you can query the database
while a functional test would expect to get a specific value from the database as defined by
Regression Testing: Regression testing is a crucial stage for the product and very useful
for the developers to identify the stability of the product with the changing requirements.
Regression testing is a testing that is done to verify that a code change in the software does
50
System Testing: System testing of software or hardware is testing conducted on a complete
integrated system’s compliance with its specific requirements. System testing is a series of
different tests whose primary purpose is to fully exercise the computer-based system
Performance Testing: It checks the speed, response time, reliability, resource usage,
scalability of a software program under the expected workload. The purpose of Performance
Testing is not to find functional defects but to eliminate performance bottlenecks in the
software or device.
Alpha Testing: This is a form of internal acceptance testing performed mainly by the house
software QA and testing teams. Alpha testing is the latest testing done by the test teams at the
development site after the acceptance testing and before releasing the software for the beta
test. It can also be done by the potential users or customers of the application. But still, this is
a form of in-house acceptance testing.
Beta Testing: This is a testing stage followed by the internal full alpha test
cycle. This is the final testing phase where the companies release the software to a few
external user groups outside the company test teams or employees. This initial software
version is known as the beta version. Most companies gather user feedback in this release.
Black Box Testing: It is also known as Behavioural testing, is a software testing method in
which the internal structure or design or implementation of the item being
tested is not known to the tester. These tests can befunctional or non-functional, though
usually functional.
51
Fig 8.1.1 black box testing
This method is named so because the software program, in the eyes of the tester, is like
a black box; inside which one cannot see. This method attempts to find errors in the
following categories:
White Box Testing: White box Testing(also known asClear Box Testing, Open box
Testing, Glass Box Testing, Transparent Box Testing, Code Based Testing or Structural
Testing) is a software testing method in which the internal structure or design or
Implementation of the item being tested is known to the tester. The tester chooses inputs to
exercise paths through the code and determines the appropriate outputs. Programming know-
how and the implementation knowledge is essential. White box testing is testing beyond the
user interface and into the nitty-gritty of a system. This method is named so because the
software program, in the eyes of the tester, is like a white/transparent box; inside which one
clearly sees.
Multi Layer Perceptron Testing : Multi Layer Perceptron testing is the testing after
modification of a system, component, or a group of related units to ensure that the
52
modification is working correctly and is not damaging or imposing other modules to produce
unexpected results. It falls under the class of black box testing.
mlpc.py
53
"Url_Prop1","Url_Prop2","Url_Prop3","Url_Prop4","Url_Prop5","Url_Prop6","Url_Prop7","
Url_Prop8","Phis_Probability"]].to_numpy() inputs = data[:,:-1] outputs = data[:, -1]
training_inputs = inputs[:1800]
training_outputs = outputs[:1800]
testing_inputs = inputs[1800:]
testing_outputs = outputs[1800:]
classifier = MLPClassifier()
classifier.fit(training_inputs, training_outputs)
predictions = classifier.predict(testing_inputs)
accuracy = 100.0 * accuracy_score(testing_outputs, predictions)
print ("The accuracy of MLPC on testing data is: " + str(accuracy))
testSet = [[1,1,1,1,1,1,0,0,1,1,1,1,1]] test = pd.DataFrame(testSet)
predictions = classifier.predict(test)
print('MLPC Prediction on the first test set is:',predictions)
testSet = [[1,1,0,1,0,1,0,1,1,1,0,0,0]] test =
pd.DataFrame(testSet) predictions = classifier.predict(test)
print('MLPC Prediction on the second test set is:',predictions)
testSet = [[0,1,1,1,0,1,0,1,0,1,0,1,1]] test =
pd.DataFrame(testSet) predictions = classifier.predict(test)
print('MLPC Prediction on the third test set is:',predictions)
print("Note: 0 indicates Low probable Phishing Site,1 indicates Medium probable Phishing
Site,2 indicates High probable Phishing Site ")
54
8.3 TEST CASE
MLPC CLASSIFICATION
[0]-low probable phishing site
[2]-Medium probable phishing site
[2]-High probable phishing site
CHAPTER-9
CONCLUSION
55
Phishing Targets naive online users tricking into revealing sensitive information such as
username, password, social security number or credit card number etc. Attackers fool the
Internet users by masking the webpage as a trustworthy or legitimate page to retrieve
personal information. There are many anti-phishing solutions such as blacklist or whitelist,
heuristic and visual similarity-based methods proposed to date, but online users are still
getting trapped into revealing sensitive information in phishing websites.
This project entitled “Detection of Phishing Websites using Deep Learning Algorithms.”
is useful in classifying the phishing web sites data set using decision tree classification and
Multi Layer Perceptron Classification. The project is useful to the newbies of social media to
protect them against phishing URLs. This project finally leads to the improvement of quality
of the world wide web.
56
CHAPTER-10
FUTURE SCOPE
Phishing may never go out of season, but with the right approach one can minimise the
risk that your organisation will ever get hooked. This prototype has a very great potential
to be further improved in the future. As of now, the project is tested by using the data set
provided by Proofpoint containing data generated in the USA. The applicability of this
project to other similar data sets with the other countries data, need to be explored.
57
CHAPTER-11
REFERENCES
[2] Aburrous , Hossain MA, Dahal K, Thabtah F. Intelligent phishing detection system
fore-banking using fuzzy data mining, Expert Systems with Applications
[4] https://nevonprojects.com/detecting-phishing-websites-using-machine-learning/
[5] https://ieeexplore.ieee.org/document/8769571
58