[go: up one dir, main page]

100% found this document useful (1 vote)
100 views65 pages

Detecting Phishing Websites

Download as docx, pdf, or txt
Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1/ 65

DETECTING PHISHING WEBSITES

Mini Project II report submitted in partial fulfilment of requirements for the


award of degree of
BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE AND ENGINEERING
BY
POPURI TRIVENI (Reg No:19131A05H9)

METTU AMARNADH REDDY (Reg No:19131A05D2)

ROKALLA SRUTHI (Reg No:19131A05K6)

PANGA YOGESH (Reg No:19131A05G9)

Under the esteemed guidance of


Mrs. G. VANI
(Assistant Professor)
Department of Computer Science and Engineering
GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING (AUTONOMOUS)
(Affiliated to JNTU-K, Kakinada)
VISAKHAPATNAM
2022 – 2023

I
Gayatri Vidya Parishad College of Engineering (Autonomous)
Visakhapatnam

CERTIFICATE
This report on
“Detecting Phishing websites”
is a bonafide record of the mini project work submitted

By

POPURI TRIVENI (Reg No:19131A05H9)

METTU AMARNADH REDDY (Reg No:19131A05D2)

ROKALLA SRUTHI (Reg No:19131A05K6)

PANGA YOGESH (Reg No:19131A05G9)

in their VII semester in partial fulfilment of the requirements for the Award of Degree of
Bachelor of Technology In
Computer Science and Engineering
during the academic year 2022-2023

Mrs. G. VANI Dr . D.N.D.HARINI

M. Tech.(A.U),(Ph.D) M.Tech., Ph.D


Assistant Professor Head of the Department
Project Guide Computer Science and
Engineering

II
Declaration of the student
I, the undersigned solemnly declare that this report is based on social service activities carried
out by me during the study (2019-23) for the partial fulfilment of the award of degree
Bachelor of Technology in Engineering.

I further declare that the proofs submitted are genuine to the best of my knowledge.

Signatures of the Students:

Name of the Student Signature

P. Triveni

M. Amarnadh Reddy

R. Sruthi

P. Yogesh

III
ACKNOWLEDGEMENT

We would like to express our deep sense of gratitude to our esteemed institute Gayatri
Vidya Parishad College of Engineering (Autonomous), which has provided us an
opportunity to fulfil our cherished desire.

We express our profound gratitude and our deep indebtedness to our guide Mrs. G. Vani ,
Assistant Professor, Department of Computer Science and Engineering, whose valuable
suggestions, guidance and comprehensive assistance helped us a lot in realizing my present
project.

We express our deep sense of Gratitude to Dr . D.N.D.HARINI, Associate Professor and


Head of the Department of Computer science and Engineering, Gayatri Vidya Parishad
College of Engineering(Autonomous) for giving us an opportunity to do the project in
college.

We express our sincere thanks to our Principal Dr . A.B. KOTESWARA RAO, Gayatri
Vidya Parishad College of Engineering(Autonomous) for his encouragement to us during
this project, giving us a chance to explore and learn in the form of mini project.

We also thank our coordinator, Dr . CH. SITA KUMARI, Associate Professor,


Department of Computer Science and Engineering, for the kind suggestions and guidance
for the successful completion of our project work.

Finally, we would also like to thank all the members of the teaching and non-teaching staff of
the Computer Science and Engineering Department for all their support in completion of our
project.

POPURI TRIVENI

METTU AMARNADH REDDY

IV
ROKALLA SRUTHI

PANGA YOGESH
ABSTRACT

In recent years as the use of the mobiles has increased, there has been an increasing tendency
to move almost all real-world operations to the cyber world. This makes our daily lives
easier, but due to the anonymous structure of the Internet, it also introduces many security
gaps with it. Used antivirus programs and firewall systems can prevent most attacks.
However, experienced attackers target computer user weaknesses by attempting to phish
computer users on fake websites. Scam websites appear to be genuine websites with logos
and graphics of this website. Phishing attacks are used to steal sensitive information such as
user IDs, passwords, bank accounts and credit card numbers. These pages imitate popular
banks, social media, e-commerce and more. This project aims to detect fraud or phishing on
websites using deep learning algorithms. Phishing problems are huge and there is no single
solution that can effectively minimise all vulnerabilities, so multiple methods are used. One
of the approaches uses shape-based analysis to authenticate the website. We use deep
learning techniques and algorithms to evaluate different features of URLs and websites.

V
INDEX

CHAPTER 1. INTRODUCTION 1
1.1 Objective 2
1.2 About the Algorithm 2
1.3 Purpose 3
1.4 Scope 3

CHAPTER 2. SRS DOCUMENT 5


2.1 Functional Requirements 5
2.2 Non-functional Requirements 5
2.3 Minimum Hardware Requirements 6
2.4 Minimum Software Requirements 6

CHAPTER 3. ALGORITHM ANALYSIS 7


3.1 Existing Algorithm 7
3.2 Proposed Algorithm 9
3.3 Feasibility Study 10
3.4 Cost Benefit Analysis 12

CHAPTER 4. SOFTWARE DESCRIPTION 14


4.1Anaconda Navigator 14
4.2 SK Learn 14
4.3 Pandas 15
4.4 Qt Designer 15
4.5 Matplotlib 15
4.6 Numpy 15
4.7 Streamlit 16

CHAPTER 5. PROJECT DESCRIPTION 17


5.1 Problem Definition 17
5.2 Project Overview 17
5.3 Module Description 17
5.3.1 Flask Framework 17

VI
5.3.2 Model 18

CHAPTER 6. SYSTEM DESIGN 21


6.1 Introduction to UML 21
6.2 Building Blocks of the ML 22
6.3 UML Diagrams 26

CHAPTER 7. DEVELOPMENT 30
7.1 Datasets used 30
7.2 Sample Code 32
7.3 Results 42

CHAPTER 8. TESTING 47
8.1 Introduction to testing 47
8.2 Test Code 51
8.3 Test Cases 53

CHAPTER 9. CONCLUSION 54

CHAPTER 10. FUTURE SCOPE 55

CHAPTER 11. REFERENCE LINKS 56

VII
CHAPTER-1
INTRODUCTION
Phishing is a fraudulent technique that uses social and technological tricks to steal customer
identification and financial credentials. Social media systems use spoofed emails from
legitimate companies and agencies to enable users to use fake websites to divulge financial
details like usernames and passwords .Hackers install malicious software on computers to
steal credentials, often using systems to intercept username and passwords of consumers’
online accounts. Phishers use multiple methods, including email, Uniform Resource Locators
(URL), instant messages, forum postings, telephone calls, and text messages to steal user
information. The structure of phishing content is similar to the original content and trick users
to access the content in order to obtain their sensitive data. The primary objective of phishing
is to gain certain personal information for financial gain or use of identity theft. Phishing
attacks are causing severe economic damage around the world. Moreover, Most phishing
attacks target financial/payment institutions and webmail, according to the Anti-Phishing
Working Group (APWG) latest Phishing pattern studies.

In order to receive confidential data, criminals develop unauthorised replicas of a real website
and email, typically from a financial institution or other organisation dealing with financial
data . This email is rendered using a legitimate company’s logos and slogans. The design and
structure of HTML allow copying of images or an entire website . Also, it is one of the
factors for the rapid growth of the Internet as a communication medium, and enables the
misuse of brands, trademarks and other company identifiers that customers rely on as
authentication mechanisms. To trap users, Phisher sends "spooled" mails to as many people
as possible. When these emails are opened, the customers tend to be diverted from the
legitimate entity to a spoofed website.

1
1.1 PROJECT OBJECTIVE:
A phishing website (sometimes called a "spoofed" site, or spam URL) tries to steal your
account password or other confidential information by tricking you into believing you're on a
legitimate website. You could even land on a phishing site by mistyping a URL (web
address) The URLs of phishing web sites usually differs from the ordinary web sites in many
aspects such as: The url length of the fishing websites is usually long. Those
URLs have unusual symbols like '@'. Also those URLs have relatively more number of
dots(.). This project aims at classifying the phishing web sites data set by using the decision
tree classification.

1.2 ALGORITHMS USED:


Multilayer perceptron classification & Decision tree algorithms.

1.2.1 ABOUT THE ALGORITHM:


Multilayer Layer Perceptron:
The Multilayer Perceptron was developed to tackle this limitation. It is a neural network
where the mapping between inputs and output is non-linear. A Multilayer Perceptron has
input and output layers, and one or more hidden layers with many neurons stacked together.
And while in the Perceptron the neuron must have an activation function that imposes a
threshold, like ReLU or sigmoid, neurons in a Multilayer Perceptron can use any arbitrary
activation function.

2
Figure 1.2.1.1 Multilayer Perceptron

Multilayer Perceptron falls under the category of feedforward algorithms, because inputs are
combined with the initial weights in a weighted sum and subjected to the activation function,
just like in the Perceptron. But the difference is that each linear combination is propagated to
the next layer. Each layer is feeding the next one with the result of their computation, their
internal representation of the data. This goes all the way through the hidden layers to the
output layer. But it has more to it. If the algorithm only computed the weighted sums in each
neuron, propagated results to the output layer, and stopped there, it wouldn’t be able to learn
the weights that minimise the cost function. If the algorithm only computed one iteration,
there would be no actual learning.

1.3 VISION/PURPOSE:
Phishing is a form of fraud in which the attacker tries to learn sensitive information such as
login credentials or account information by sending as a reputable entity or person in email or
other communication channels. The purpose of Phishing Domain Detection is detecting
phishing domain names. Therefore, passive queries related to the domain name, which we
want to classify as phishing or not, provide useful information to us. The project aims at
developing a tool for Detection of Phishing Websites using Deep Learning Algorithms with
all the above mentioned advantages.

MISSION:

3
This tool is developed by using Python along with its layout toolkit PyQt, PyUIC and
python’s sklearn.

1.4 PROJECT SCOPE:


The project comprises four modules. First module deals with storing the URL Properties and
site characteristics of phishing web sites features data into the DB, The second module deals
with the creation of a sub data set plot. The third module deals with the development of
python routines to execute the multi layer perceptron classification on the Phishing sites data
set and the final module deals with the decision tree analysis of Phishing Sites data

The project uses the 'sklearn' module of Python to perform the multi layer perceptron
classification and decision tree analysis.Multi Layer Perceptron Classifier relies on an
underlying Neural Network to perform the task of classification. A multilayer perceptron
(MLP) is a feedforward artificial neural network model that maps sets of input data onto a set
of appropriate outputs. Multilayer Perceptron Classifier is characterised by input layer
Features, no .of hidden layers, training set classes, no. of training iterations, learning rate and
activation function.. Decision tree build classification or Multi Layer Perceptron models in
the form of a tree structure. It breaks down a dataset into smaller and smaller subsets while at
the same time an associated decision tree is incrementally developed. The final result is a tree
with decision nodes and leaf nodes. A decision node has two or more branches. Decision
trees can handle both categorical and numerical data.

4
CHAPTER-2

SRS DOCUMENT

A software requirements specification (SRS) is a document that describes what the


software will do and how it will be expected to perform.

2.1 FUNCTIONAL REQUIREMENTS:


A Functional Requirement (FR) is a description of the service that the software must offer. It
describes a software system or its component. A function is nothing but inputs to the software

5
system, its behaviour, and outputs. It can be a calculation, data manipulation, business
process, user interaction, or any other specific functionality which defines what function a
system is likely to perform. Functional Requirements are also called Functional Specification.

➢ The project should be useful in classifying the phishing web sites data set
using decision tree classification and Multi Layer Perceptron Classification.
➢ The project should be useful to the newbies of social media to protect them
against phishing URLs.
➢ This project should finally lead to the improvement of quality of the world
wide web.

2.2 NON-FUNCTIONAL REQUIREMENTS:


Non-Functional Requirements (NFR) specifies the quality attribute of a software system.
They judge the software system based on Responsiveness, Usability, Security, Portability.
Non-functional requirements are called qualities of a system.

➢ To use latest Decision tree & Multi Layer Perceptron modules of the sklearn
tool.
➢ To use Python, which is chosen as the best programming language, by the
Programming Community . More Functionality can be implemented with less no .of
lines of code in Python.
➢ To use PyQt tool is to create the Graphical User interfaces.
➢ All the Front end code is to be generated automatically by PyUIC.

2.3 MINIMUM HARDWARE REQUIREMENTS


1. It requires a minimum of 2.16 GHz processor.

2. It requires a minimum of 4 GB RAM.

3. It requires 64-bit architecture.

4. It requires a minimum storage of 500GB.

2.4 MINIMUM SOFTWARE REQUIREMENTS


1. It requires a 64-bit windows Operating System.

6
2. Python Qt Designer for designing user interface.

3. SQLite3 for storing database Entities.

4. Pyuic for converting the layout designed user interface (UI) to python code.

5. Python language for coding.

7
CHAPTER-3

ANALYSIS

3.1 EXISTING SYSTEMS


Deep learning algorithms:

To evaluate the performance of the feature set, the feature set has been trained and cross-
validated against many different parameter combinations. In the multi feed-forward network,
we must gather data based on feature sets and then tune the parameters to achieve maximum
accuracy in phishing site classification.
It is an essential process in which training networks must set parameters and validate across
appropriate values. After attaining the right value, phishing sites can easily be classified with
the highest probability.
We used Python programming language along with the TensorFlow library to implement
deep learning algorithms.

Formal description of DNN:


The DNN is a type of machine learning technology. It consists of many common neural
network layers. It has one input layer, one output layer and at least one hidden layer.

Fig 3.1.1 DNN

Each layer is composed of the basic computing unit, i.e the neuron. The neuron is inspired by
the biological neuron that performs mathematical functions for the storage of information.

8
This information is transmitted to another neuron, and therefore information propagates in the
neural network. A neuron’s general mathematical representation is

● where U is activation function


● ·Wk 2 RLB is weight of Kth neuron ●
·Yk is the output of Kth neuron.

The number of neurons in the input layer depends upon the dimension of datasets or
equivalently to the number of features of the dataset, i.e., X∈R^lk

● where L is the total number of the


datasets ● K is the total number of features
in datasets and ● R represents a real
number.

Let l = {0,1,2,3,4,5} be the layers in my deep learning model, beY^(l-1) the input to layers
{1,2,3,4,5}, Y^(l) be output value of layer, where W^(l) is weight of layer i that is used for
linear transformation of inputs from n layers to output of m layers, B^(l) be bias of layer i and
F^(l) be the associated activation function of each layer. Y^(l) is nothing but the input layer
and Y^(l) is output layer.

where * is for matrix multiplication. W values were initialised with Xavier Initialization, and
B was initialised with zero. W and B are updated after each iteration in the backpropagation
method. Layer 0 is the input layer, layer 6 is output layer 6 and layers 1–5 are hidden layers
activated with the ReLU function provided by

9
where i represents ith iteration and l represents the lth layer. The intermediate output of our
model Y* is obtained as follows using the sigmoid activation function:

where l=6 in case of output layer. The loss function (l(Y*,Y ))over entire dataset is defined as
sum of cross-entropy between model output and actual output, which is shown as follows:

where Y* is an intermediate output of entire dataset obtained after processing it through deep
learning model and y*∈(0,1) jth row of Y* while Y^ is an actual label of our dataset and Yj
∈{0,1}is jth row of Y*, where 0 represents a legitimate site and 1 indicates phishing site.

The loss function given earlier is optimized using the Adam Optimizer at every epoch to
update parameters and train deep neural model using the backpropagation algorithmThe
functional formations represent these features without overfitting, because DNN has 5 hidden
layers along with 1 input and 1 output layer.

3.2 PROPOSED ALGORITHM:

Multilayer Perceptron Algorithm:

❖ The MLP is a special case of a feedforward neural network where every layer
is a fully connected layer, MLP concept is used in generic form, in loosely form
means a feedforward ANN, and more accurately is used for multiple layers of
perceptions.
❖ MLP consists of sequential layers of function compositions, the raw data enter
from the input layer, then it will generate the input for the next layer .
❖ The output of the hidden layer will be input for the output layer to apply the
final function, each layer consists of a set of nodes or ‘neurons’, the node receives the
input from the previous layer by applying an activation function, the activation

10
function is the identity function for linear regression and the logistic or sigmoid
function for logistic regression.
❖ By expanding the MLP network in depth and width the function flexibility
will be increased. In our experiments, we try to increase the depth and width to find
out the best network structure to improve the proposed model performance.

3.3 FEASIBILITY STUDY


A feasibility study is an analysis that takes all a project's relevant factors into account
including economic, technical, legal, and scheduling considerations to ascertain the likelihood
of completing the project successfully. A feasibility study is important and essential to decide
whether any proposed project is feasible or not. A feasibility study is simply an assessment of
the practicality of a proposed plan or project.

The main objectives of feasibility are mentioned below:

To determine if the product is technically and financially feasible to develop, is the main aim
of the feasibility study activity. A feasibility study should provide management with enough
information to decide:

o Whether the project can be done. o To determine how successful your


proposed action will be. o Whether the final product will benefit its intended users.
o To describe the nature and complexity of the project.
o What are the alternatives among which a solution will be chosen (During
subsequent phases)
o To analyse if the software meets organisational requirements.

There are various types of feasibility that can be determined. They are

Operational - Define the urgency of the problem and the acceptability of any solution,
includes people-oriented and social issues: internal issues, such as manpower problems, labor
objections, manager resistance, organisational conflicts, and policies; also, external issues,
including social acceptability, legal aspects, and government regulations.

Technical: Is the feasibility within the limits of current technology? Does the technology
exist at all? Is it available within a given resource?

11
Economic - Is the project possible, given resource constraints? Are the benefits that will
accrue from the new system worth the costs? What are the savings that will result from the
system, including tangible and intangible ones? What are the development and operational
costs?

Schedule - Constraints on the project schedule and whether they could be reasonably met.

3.3.1 Economic Feasibility: Economic analysis could also be referred to as cost/benefit


analysis. It is the most frequently used method for evaluating the effectiveness of a new
system. In economic analysis the procedure is to determine the benefits and savings that are
expected from a candidate system and compare them with costs. Economic feasibility study
related to price, and all kinds of expenditure related to the scheme before the project starts.
This study also improves project reliability.

It is also helpful for the decision-makers to decide the planned scheme processed latter or
now, depending on the financial condition of the organisation. This evaluation process also
studies the price benefits of the proposed scheme. Economic Feasibility also performs the
following tasks.

o Cost of packaged software/ software development. o Cost of doing full

system study. o Is the system cost Effective?

3.3.2 Technical Feasibility: A large part of determining resources has to do with assessing
technical feasibility. It considers the technical requirements of the proposed project. The
technical requirements are then compared to the technical capability of the organisation. The
systems project is considered technically feasible if the internal technical capability is
sufficient to support the project requirements.

The analyst must find out whether current technical resources can be where the expertise of
system analysts is beneficial, since using their own experience and their contact with vendors
they will be able to answer the question of technical feasibility. Technical Feasibility also
performs the following tasks.

12
o Is the technology available within the given resource constraints? o Is

the technology have the capacity to handle the solution. o Determines

whether the relevant technology is stable and established. o Are the

technology chosen for software development has a large number of users so

that they can be consulted when problems arise, or improvements are

required?

3.3.3 Operational Feasibility: Operational feasibility is a measure of how well a proposed


system solves the problems and takes advantage of the opportunities identified during scope
definition 17 and how it satisfies the requirements identified in the requirements analysis
phase of system development.

The operational feasibility refers to the availability of the operational resources needed to
extend research results beyond on which they were developed and for which all the
operational requirements are minimal and easily accommodated.

In addition, the operational feasibility would include any rational compromises farmers make

in adjusting the technology to the limited operational resources available to them. The

operational Feasibility also perform the tasks like o Does the current mode of operation

provide adequate response time? o Does the current of operation make maximum use of

resources.

o Determines whether the solution suggested by the software


development team is acceptable.

o Does the operation offer an effective way to control the data?

Our project operates with a processor and packages installed are supported by the system.

3.4 COST BENEFIT ANALYSIS:


The financial and the economic questions during the preliminary investigation are verified to
estimate the following:

13
● The cost of the hardware and software for the class of application
being considered.
● The benefits in the form of reduced cost.
● The proposed system will give the minute information, as a result.
● Performance is improved which in turn may be expected to provide
increased profits.
● This feasibility checks whether the system can be developed with the
available funds.
● This can be done economically if planned judicially, so it is
economically feasible.

14
15
CHAPTER-4

SOFTWARE DESCRIPTION

4.1 Anaconda-Navigator:
Anaconda is an open source and free distribution of R and Python programming language for
machine learning as well as data science projects. Therefore, it is known as a professional
data science platform. It contains a powerful environment manager, which provides a
different type of Python environment such as a Spyder, Jupyter notebook, and so on.
Navigator can search for packages on Anaconda Cloud or in a local Anaconda Repository. It
is available for Windows, macOS, and Linux.

➢ Anaconda is a free and open-source distributor.


➢ Quickly download 7,500+ Python/R data science packages.
➢ Manages a lot of Python libraries.
➢ It provides various environments by virtualization.
➢ Can easily deal with large data computing.
➢ It works properly without the need for any administrative privileges.
➢ Visualise results with Matplotlib, Bokeh, Datashader , and Holoviews.

4.2 Sk Learn:
Scikit-Learn is a free machine learning library for Python. It supports both supervised and
unsupervised machine learning, providing diverse algorithms for classification, regression,
clustering, and dimensionality reduction.
The library is built using many libraries you may already be familiar with, such as NumPy
and SciPy. It also plays well with other libraries, such as Pandas and Seaborn.
It represents the library provides access to many different datasets, one of which is the
famous iris dataset. The dataset is so famous that it’s often referred to as the “hello world” of
deep learning.

4.3 Pandas:
16
The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and
was created by Wes McKinney in 2008.

Pandas is a Python library. Pandas is used to analyse data. Pandas is a Python library used for
working with data sets. It has functions for analysing, cleaning, exploring, and manipulating
data. It uses Pandas that allows us to analyse big data and make conclusions based on
statistical theories. Pandas can clean messy data sets, and make them readable and relevant.
Relevant data is very important in data science.

4.4 Qt Designer:
Qt Designer is a Qt tool that provides you with a what-you-see-is-what-you-get
(WYSIWYG) user interface to create GUIs for your PyQt applications productively and
efficiently. The PyQt installer comes with a GUI builder tool called Qt Designer.
Using its simple drag and drop interface, a GUI interface can be quickly built without having
to write the code.
It is, however, not an IDE such as Visual Studio. Hence, Qt Designer does not have the
facility to debug and build the application.

4.5 Matplotlib:
Matplotlib is a plotting library for python. Matplotlib is an amazing visualisation library in
Python for 2D plots of arrays. Matplotlib is a multi-platform data visualisation library built
on NumPy arrays and designed to work with the broader SciPy stack.
Matplotlib comes with a wide variety of plots. Plots help to understand trends, patterns, and
to make correlations. It consists of various plots like line, bar, scatter etc. pyplot () is the most
important function in matplotlib library, which is used to plot 2D data.

4.6 Numpy:
● NumPy stands for Numerical Python. NumPy is a python library used for
working with arrays. It also has functions for working in the domain of linear algebra,
Fourier transform, and matrices.

17
● It is optimised to work with the latest CPU architectures. NumPy arrays are
stored at one continuous place in memory unlike lists, so processes can access and
manipulate them very efficiently.

4.7 Streamlit:

 Streamlit is an open source python based framework for developing and
deploying interactive data science dashboards and machine learning
models.
 It is built ontop of Python and supports many of the mainstream Python
libraries such as matplotlib, plotly and pandas.
 Streamlit makes it easy for you to visualize, mutate, and share data. 

4.8

18
CHAPTER-5

PROJECT DESCRIPTION

5.1 PROBLEM DEFINITION


Vision/Purpose:

The project aims at developing a tool for Detection of Phishing Websites using Deep
Learning Algorithms

Mission:

This tool is developed by using Python along with its layout toolkit PyQt, PyUIC and
python’s sklearn. We use multilayer perceptron to detect phishing websites

5.2 PROJECT OVERVIEW


The main aim of our project is Detection of Phishing Websites using Deep Learning
Algorithms. Multi layer Perceptron testing is the testing after modification of a system,
component, or a group of related units to ensure that the modification is working correctly
and is not damaging or imposing other modules to produce unexpected results. It falls
under the class of black box testing.
The steps involved in our project are:

1. Collection of datasets.

2. Data split for training and testing.

3. Training the model.

4. Repeating steps 2 and 3 for different ratios of training


and testing data to maximise accuracy.

5.3 MODULE DESCRIPTION


5.3.1 Qt Designer :

Qt Designer is the tool used to create the Graphical user interfaces in this project. The tool
can be found in the Library/Bin folder of Anaconda tool as shown in the following
screenshot.

19
Fig 5.3.1.1 Qt designer window

● As this designer tool is used again and again in the project, it’s better to create a
shortcut on the desktop, by using a right-click on designer.exe, as shown in the
following figure.

Fig 5.3.1.2 Qt designer Window

20
5.3.2 MODEL

Deep learning
Deep learning, also called deep structured learning, is part of a broader family of
machines based on Artificial neural networks learning methods with Representation
learning. Learning can be Supervised, Unsupervised and Semi-supervised. Deep
learning is an increasingly popular subset of machine learning. Deep learning models
are built using neural networks. A neural network takes in inputs, which are then
processed in hidden layers using weights that are adjusted during training. Then the
model spits out a prediction. The weights are adjusted to find patterns in order to make
better predictions. In deep learning, a computer model learns to perform classification
tasks directly from images, text, or sound. Deep learning models can achieve state-of-
the-art accuracy, learnings can be supervised, semi supervised, unsupervised
sometimes exceeding human- level performance. Models are trained by using a large
set of labelled data. architectures such as deep neural networks have been applied to
fields including computer vision, speech recognition, NLP, audio recognition and other
fields.

Types of Deep Learning:

1. Feed Forward Neural Network.


2. Recurrent Neural Network.
3. Convolutional Neural Network.
4. Restricted Boltzmann Machine
5. Autoencoders

Why Deep Learning?


Deep learning has attracted a lot of attention because it is particularly good at a type of
learning that has the potential to be very useful for real-world applications. Deep learning
networks can be successfully applied to big data for knowledge discovery, knowledge
application, and knowledge-based prediction. In other words, deep learning can be a powerful
engine for producing actionable results. Deep neural networks have recently been
successfully applied in many diverse domains as examples of end-to-end learning networks
provide a mapping between an input—such as an image of a diseased plant—to an output—
such as a crop disease pair. It is useful for image recognition. It provides better solutions to
get optimised images.
21
Multilayer perceptron
● Multi layer perceptron (MLP) is a supplement of feed forward neural network.
● It consists of three types of layers—the input layer, output layer and hidden
layer, as shown in Fig. The input layer receives the input signal to be processed.
● The required task such as prediction and classification is performed by the
output layer. An arbitrary number of hidden layers that are placed in between the
input and output layer are the true computational engine of the MLP.
● Similar to a feed forward network in a MLP the data flows in the forward
direction from input to output layer. The neurons in the MLP are trained with the back
propagation learning algorithm.
● MLPs are designed to approximate any continuous function and can solve
problems which are not linearly separable. The major use cases of MLP are pattern
classification, recognition, prediction and approximation.

Fig 5.3.2.1 multilayer perceptron

22
CHAPTER-6
SYSTEM DESIGN
6.1 Introduction to UML
Unified Modelling Language (UML) is a general-purpose modelling language. The
main aim of UML is to define a standard way to visualise the way a system has been
designed. It is quite like blueprints used in other fields of engineering. UML is not a
programming language, it is rather a visual language. We use UML diagrams to
portray the behaviour and structure of a system. UML helps software engineers,
businessmen and system architects with modelling, design and analysis. The Object
Management Group (OMG) adopted Unified Modelling Language as a standard in
1997. It's been managed by OMG ever since. The International Organisation for
Standardisation (ISO) published UML as an approved standard in 2005. UML has been
revised over the years and is reviewed periodically.

Why we need UML


● Complex applications need collaboration and planning from multiple teams
and hence require a clear and concise way to communicate amongst them.
● Businessmen do not understand code. So, UML becomes essential to
communicate with non-programmers’ essential requirements, functionalities and
processes of the system.
● A lot of time is saved down the line when teams can visualise processes,
user interactions and static structure of the system.
● UML is linked with object-oriented design and analysis. UML makes the
use of elements and forms associations between them to form diagrams. Diagrams
in UML can be broadly classified as:
○ Structural Diagrams – Capture static aspects or structure of a system.
Structural Diagrams include Component Diagrams, Object Diagrams,
Class Diagrams and Deployment Diagrams.
○ Behaviour Diagrams – Capture dynamic aspects or behaviour of the system.
Behaviour diagrams include Use Case Diagrams, State
Diagrams, Activity Diagrams and Interaction Diagrams.

23
Fig 6.1.1 flow chart

6.2 Building Blocks of the UML


● The vocabulary of the UML encompasses three kinds of building blocks:
1. Things
2. Relationships
3. Diagrams
● Things are the abstractions that are first-class citizens in a model; relationships
tie these things together; diagrams group interesting collections of things.

Things in the UML


● There are four kinds of things in the UML:
➢ Structural things
➢ Behavioural things
➢ Grouping things
➢ Annotational things

● These things are the basic object-oriented building blocks of the UML.
You use them to write well-formed models.

Structural Things
Structural things are the nouns of UML models. These are the mostly static parts of a model,
representing elements that are either conceptual or physical. Collectively, the structural things
are called classifiers.
A class is a description of a set of objects that share the same attributes, operations,
relationships, and semantics. A class implements one or more interfaces. Graphically, a
class is rendered as a rectangle, usually including its name, attributes, and operations

24
● Class - A Class is a set of identical things that outlines the functionality and
properties of an object. It also represents the abstract class whose
functionalities are not defined. Its notation is as follows

● Interface - A collection of functions that specify a service of a class or


component,
i.e. Externally visible behaviour of that class.

● Collaboration - A larger pattern of behaviours and actions.


Example: All classes and behaviours that create the modelling of a moving tank
in a simulation.

● Use Case - A sequence of actions that a system performs that yields an


observable result. Used to structure behaviour in a model. Is realised by collaboration.

● Component - A physical and replaceable part of a system that implements a


number of interfaces. Example: a set of classes, interfaces, and collaborations.

● Node - A physical element existing at run time and represents a source.

25
Behavioural Things
● Behavioural things are the dynamic parts of UML models. These are
the verbs of a model, representing behaviour over time and space. In all, there
are three primary kinds of behavioural things
1. Interaction
2. State machine 1) Interaction
● It is a behaviour that comprises a set of messages exchanged among a
set of objects or roles within a particular context to accomplish a specific
purpose.
● The behaviour of a society of objects or of an individual operation may
be specified with an interaction.
● An interaction involves a number of other elements, including
messages, actions, and connectors (the connection between objects).
● Graphically, a message is rendered as a directed line, almost always
including the name of its operation.

2) State machine

State machine is a behaviour that specifies the sequences of states an object or an


interaction goes through during its lifetime in response to events, together with its
responses to those events. The behaviour of an individual class or a collaboration of
classes may be specified with a state machine. A state machine involves a number of
other elements, including states, transitions (the flow from state to state), events
(things that trigger a transition), and activities (the response to a transition).
Graphically, a state is rendered as a rounded rectangle, usually including its name and

its substates.

26
Grouping Things

● Grouping things can be defined as a mechanism to group elements of a


UML model together. There is only one grouping thing available.
● Package − Package is the only one grouping thing available for gathering
structural and behavioural things.

Annotational Things

● Annotational things are the explanatory parts of UML models. These are
the comments you may apply to describe, illuminate, and remark about any
element in a model.
● There is one primary kind of annotational thing, called a note. A note is
simply a symbol for rendering constraints and comments attached to an element or

a collection of elements.

Relationships in the UML


● Relationship is another most important building block of UML. It shows
how the elements are associated with each other and this association describes the
functionality of an application.
● There are four kinds of relationships in the UML:
○ Dependency
○ Association
○ Generalisation
○ Realisation
1. Dependency
● It is an element (the independent one) that may affect the semantics of
the other element (the dependent one).
● Graphically, a dependency is rendered as a dashed line, possibly
directed, and occasionally including a label.

27
2. Association
● Association is basically a set of links that connects the elements of a
UML model.
● It also describes how many objects are taking part in that relationship.

3. Generalisation
● It is a specialization/generalisation relationship in which the
specialized element (the child) builds on the specification of the generalised
element (the parent).
● The child shares the structure and the behaviour of the parent.
Graphically, a generalisation relationship is rendered as a solid line with
a hollow arrowhead pointing to the parent

4. Realisation
● Realisation can be defined as a relationship in which two elements are
connected.
● One element describes some responsibility, which is not implemented
and the other one implements them.
● This relationship exists in the case of interfaces.

6.3 UML DIAGRAMS


● UML is a modern approach to modelling and documenting software. It is
based on
diagrammatic representations of software components.
● It is the final output, and the diagram represents the system
● .UML includes the following

28
➢ Class diagram
➢ Object diagram
➢ Component diagram
➢ Composite structure diagram
➢ Use case diagram
➢ Sequence diagram
➢ Communication diagram
➢ State diagram
Activity diagram

Fig 6.3.1: use case diagram

Description of Use case diagram


● Use case diagrams are usually referred to as behaviour diagrams used to
describe a set of actions (use cases) that some system or systems (subject) should or
can perform in collaboration with one or more external users of the system (actors).
● A use case diagram at its simplest is a representation of a user's interaction
with the system that shows the relationship between the user and the different use
cases in which the user is involved.
● As we can see the user is interacting with the system by providing URL

29
Features,Website Characteristics as input, and then followed by the calculation of the
accuracies of Multi Layer Perceptron Classification and Decision Tree Analysis.

Fig 6.3.2 : sequence diagram

Description of sequence diagram

● A sequence diagram is an interaction diagram that shows how objects operate


with one another and in what order.
● It is a construct of a message sequence chart. A sequence diagram shows
object interactions arranged in time sequence.
● From the above mentioned sequence diagram we have to go in sequence:
Enter the needed details as shown in the above figure, Provide the URL
Features,Website Characteristics and then generate the outputs viz…. accuracies
along with the predictions of Decision Tree and Multi Layer perceptron
classifications.

ACTIVITY DIAGRAM

30
Fig 6.3.3:activity diagram

Description of Activity diagram

Activity diagram is another important diagram in UML to describe dynamic aspects of the
system. Activity diagram is basically a flowchart to represent the flow from one activity to
another activity. The activity can be described as an operation of the system. So, the control
flow is drawn from one operation to another. In the activity diagram we can see that first
provide the dataset with URL Features and website Characteristics and then split the dataset
into training,test sets, perform the calculation of Multi Layer Perceptron Classification
accuracy & Decision Tree analysis accuracy along with the predictions.

31
CHAPTER-7

DEVELOPMENT

7.1 DATASET USED

● A DATASET is a set or collection of data. This set is normally presented in a


tabular pattern. Every column describes a particular variable. Each row corresponds to
a given member of the data set, as per the given question. This is a part of DATA
MANAGEMENT
● Some types of Data sets are : Numerical, Bivariate, Multivariate, categorical..
● Our dataset consists of various data related to website characteristics and url
properties
● Website characteristics includes whether the website consists of any
Mysterious or Non-Mysterious links, working or non-working links, age of the
website i.e. the website existed from less than one year or more than one year, faicon
loaded from same domain or different domain and the website slow downing the
performance of the browser or not
● Url properties include url length, Slash position, IP address part, Shortened
URL or Not a shortened URL,@,- Symbols existed or not, three dots present in the
URL or not, whether the url has HTTP or HTTPS.
● The images of our dataset are:

32
33
7.2 SAMPLE CODE

Phisdet1.py

import sys import os from phisdet import * from


PyQt5 import QtWidgets, QtGui, QtCore class
MyForm(QtWidgets.QMainWindow): def
__init__(self,parent=None):
QtWidgets.QWidget.__init__(self,parent) self.ui =
Ui_MainWindow() self.ui.setupUi(self)
self.ui.pushButton.clicked.connect(self.wschar)
self.ui.pushButton_3.clicked.connect(self.mlpc)
self.ui.pushButton_4.clicked.connect(self.dtree)
self.ui.pushButton_5.clicked.connect(self.urlprp)
self.ui.pushButton_6.clicked.connect(self.subplt1)

def wschar(self):
os.system("python sitechar1.py")

def mlpc(self): os.system("python -W


ignore mlpc1.py")

def dtree(self): os.system("python -W


ignore dt1.py")

def urlprp(self):
os.system("python urlprop1.py")

def subplt1(self): os.system("python


datasubset1.py")

if __name__ == "__main__":
app = QtWidgets.QApplication(sys.argv)
myapp = MyForm()
myapp.show()
sys.exit(app.exec_())

34
Urlprop1.py
import sys import os from urlprop import *
from PyQt5 import QtWidgets, QtGui,
QtCore import sqlite3 con =
sqlite3.connect('phisdet1')

class MyForm(QtWidgets.QMainWindow): def


__init__(self,parent=None):
QtWidgets.QWidget.__init__(self,parent) self.ui =
Ui_MainWindow() self.ui.setupUi(self)
self.ui.pushButton.clicked.connect(self.insertvalues)
#self.ui.pushButton_2.clicked.connect(self.testdetails)

def insertvalues(self):
with con:
cur = con.cursor() uid =
str(self.ui.lineEdit_9.text()) s1 =
str(self.ui.lineEdit_3.text()) s2 =
str(self.ui.lineEdit_4.text()) s3 =
str(self.ui.lineEdit_5.text()) s4 =
str(self.ui.lineEdit_6.text()) s5 =
str(self.ui.lineEdit_2.text()) s6 =
str(self.ui.lineEdit_7.text()) s7 =
str(self.ui.lineEdit_8.text()) s8 =
str(self.ui.lineEdit_10.text())
cur.execute('INSERT INTO
urlpr

ops(uid,s1,s2,s3,s4,s5,s6,s7,s8)
VALUES(?,?,?,?,?,?,?,?,?)',(uid,s1,s2,s3,s4,s5,s6,s7,s8)) con.commit()

if __name__ == "__main__":
app =
QtWidgets.QApplication(sys.argv)
myapp = MyForm() myapp.show()
sys.exit(app.exec_())

35
Sitechar1.py
#This program gets two values from a DB into lineEdits.
import sys import os from sitechar import *
from PyQt5 import QtWidgets, QtGui,
QtCore

import sqlite3 con =


sqlite3.connect('phisdet1')

class MyForm(QtWidgets.QMainWindow): def


__init__(self,parent=None):
QtWidgets.QWidget.__init__(self,parent) self.ui =
Ui_MainWindow() self.ui.setupUi(self)
self.ui.pushButton.clicked.connect(self.insertvalues)
#self.ui.pushButton_2.clicked.connect(self.testdetails)

def insertvalues(self):
with con:
cur = con.cursor() uid =
str(self.ui.lineEdit_9.text()) s1 =
str(self.ui.lineEdit_4.text()) s2 =
str(self.ui.lineEdit_5.text()) s3 =
str(self.ui.lineEdit_6.text()) s4 =
str(self.ui.lineEdit_7.text()) s5 =
str(self.ui.lineEdit_8.text())
cur.execute('INSERT INTO
sitechars(uid,s1,s2,s3,s4,s5)
VALUES(?,?,?,?,?,?)',(uid,s1,s2,s3,s4,s5))
con.commit()

if __name__ == "__main__":
app =
QtWidgets.QApplication(sys.argv)
myapp = MyForm() myapp.show()
sys.exit(app.exec_())

36
Datasubset1.py

#This plot is generated by considering a subset of 500 rows, and the first and
last columns from the dataset. import numpy as np import matplotlib.pyplot
as plt np.random.seed(6) import math import pandas as pd df =
pd.read_csv('phisset.csv') df["Website_Char1"] =
df["Website_Char1"].map({'MysteriousLinks':1,'NoMysteriousLinks':0})
df["Phis_Probability"] =
df["Phis_Probability"].map({'Medium':1,'Low':0,'High':2}) cnt1 =
(df["Website_Char1"] == 1).sum() #Number of '1's in first column cnt2 =
(df["Website_Char1"] == 0).sum() #Number of '0's in first column cnt3 =
(df["Phis_Probability"] == 0).sum() #Number of '0's in last column cnt4 =
(df["Phis_Probability"] == 1).sum() #Number of '1's in last column cnt5 =
(df["Phis_Probability"] == 2).sum()

x = ['NonMyst', 'Myst.lnks.','Low', 'Medium', 'High']


y = [cnt1,cnt2,cnt3,cnt4,cnt5]

fig, ax = plt.subplots() width = 0.75 # the width of the


bars ind = np.arange(len(y)) # the x locations for the
groups ax.barh(ind, y, width, color="blue")
ax.set_yticks(ind+width/2) ax.set_yticklabels(x,
minor=False) for i, v in enumerate(y):
ax.text(v + 3, i + .25, str(v), color='blue', fontweight='bold')
plt.title('Mysterious Links vs. Phis Probability')
plt.xlabel('x') plt.ylabel('y') plt.show()
#plt.savefig(os.path.join('test.png'), dpi=300, format='png', bbox_inches='tight')
# use format='svg' or 'pdf' for vectorial pictures

Mlpc1.py

from sklearn.neural_network import MLPClassifier


import numpy as np import pandas as pd from
sklearn import *

37
from sklearn.metrics import accuracy_score
df = pd.read_csv('phisset.csv')
df["Website_Char1"=df["Website_Char1"].map({'MysteriousLinks':1,'NoMyste
riousLinks':0})
df["Website_Char2"] =
df["Website_Char2"].map({'NonWorkingLinks':1,'WorkingLinks':0})
df["Website_Char3"]=df["Website_Char3"].map({'LessThanOneYear':1,'More
T hanOneYear':0})
df["Website_Char4"]=df["Website_Char4"].map({'DiffDomainFavicon':1,'Sam
e DomainFavicon':0}) df["Website_Char5"]=
df["Website_Char5"].map({'SlowDownPerf':1,'NoSlowDownPerf':0})
df["Url_Prop1"] =
df["Url_Prop1"].map({'LessThan54':-1,'54to75':0,'MoreThan75':1})
df["Url_Prop2"]=df["Url_Prop2"].map({'DifferentPositionedSlash':1,'CorrectPo
sitionedSlash':0})
df["Url_Prop3"] =
df["Url_Prop3"].map({'IPAddressPart':1,'NoIPAddressPart':0})
df["Url_Prop4"] =
df["Url_Prop4"].map({'ShortendURL':1,'NoShortendURL':0})
df["Url_Prop5"] = df["Url_Prop5"].map({'@symbol':1,'No@symbol':0})
df["Url_Prop6"] = df["Url_Prop6"].map({'DashPresent':1,'DashNotPresent':0})
df["Url_Prop7"] =
df["Url_Prop7"].map({'2dots':-1,'3dots':0,'MoreThan3dots':1})
df["Url_Prop8"] =
df["Url_Prop8"].map({'HTTPSNotAdded':1,'HTTPSAdded':0})
df["Phis_Probability"] =
df["Phis_Probability"].map({'Medium':1,'Low':0,'High':2})
df[["Website_Char1","Website_Char2","Website_Char3","Website_Char4","W
e
bsite_Char5","Url_Prop1","Url_Prop2","Url_Prop3","Url_Prop4","Url_Prop5",

38
"Url_Prop6","Url_Prop7","Url_Prop8","Phis_Probability"]].to_numpy() inputs
= data[:,:-1] outputs = data[:, -1] training_inputs = inputs[:1800]
training_outputs = outputs[:1800] testing_inputs = inputs[1800:] testing_outputs
= outputs[1800:] classifier = MLPClassifier()
classifier.fit(training_inputs, training_outputs)
predictions = classifier.predict(testing_inputs)
accuracy = 100.0 * accuracy_score(testing_outputs, predictions)
print ("The accuracy of MLPC on testing data is: " + str(accuracy))
testSet = [[1,1,1,1,1,1,0,0,1,1,1,1,1]] test = pd.DataFrame(testSet)
predictions = classifier.predict(test)
print('MLPC Prediction on the first test set is:',predictions)
testSet = [[1,1,0,1,0,1,0,1,1,1,0,0,0]] test =
pd.DataFrame(testSet) predictions = classifier.predict(test)
print('MLPC Prediction on the second test set is:',predictions)
testSet = [[0,1,1,1,0,1,0,1,0,1,0,1,1]] test =
pd.DataFrame(testSet) predictions = classifier.predict(test)
print('MLPC Prediction on the third test set is:',predictions)
print("Note: 0 indicates Low probable Phishing Site,1 indicates Medium
probable Phishing Site,2 indicates High probable Phishing Site ")

Dt1.py
from sklearn import tree import numpy as np import pandas as pd from
sklearn import * from sklearn.metrics import accuracy_score df =
pd.read_csv('phisset.csv') df["Website_Char1"] =
df["Website_Char1"].map({'MysteriousLinks':1,'NoMysteriousLinks':0})
df["Website_Char2"] =
df["Website_Char2"].map({'NonWorkingLinks':1,'WorkingLinks':0})
df["Website_Char3"] =
df["Website_Char3"].map({'LessThanOneYear':1,'MoreThanOneYear':0}

39
) df["Website_Char4"] =
df["Website_Char4"].map({'DiffDomainFavicon':1,'SameDomainFavicon
':0}) df["Website_Char5"] =
df["Website_Char5"].map({'SlowDownPerf':1,'NoSlowDownPerf':0})
df["Url_Prop1"] = df["Url_Prop1"].map({'LessThan54':-
1,'54to75':0,'MoreThan75':1}) df["Url_Prop2"] =
df["Url_Prop2"].map({'DifferentPositionedSlash':1,'CorrectPositionedSla
sh':0}) df["Url_Prop3"] =
df["Url_Prop3"].map({'IPAddressPart':1,'NoIPAddressPart':0})
df["Url_Prop4"] =
df["Url_Prop4"].map({'ShortendURL':1,'NoShortendURL':0})
df["Url_Prop5"] = df["Url_Prop5"].map({'@symbol':1,'No@symbol':0})
df["Url_Prop6"] =
df["Url_Prop6"].map({'DashPresent':1,'DashNotPresent':0})
df["Url_Prop7"] = df["Url_Prop7"].map({'2dots':-
1,'3dots':0,'MoreThan3dots':1}) df["Url_Prop8"] =
df["Url_Prop8"].map({'HTTPSNotAdded':1,'HTTPSAdded':0})
df["Phis_Probability"] =
df["Phis_Probability"].map({'Medium':1,'Low':0,'High':2}) data =
df[["Website_Char1","Website_Char2","Website_Char3","Website_Char
4","We
bsite_Char5","Url_Prop1","Url_Prop2","Url_Prop3","Url_Prop4","Url_P
rop5",
"Url_Prop6","Url_Prop7","Url_Prop8","Phis_Probability"]].to_numpy()
inputs = data[:,:-1] outputs = data[:, -1] training_inputs = inputs[:1800]
training_outputs = outputs[:1800] testing_inputs = inputs[1800:]
testing_outputs = outputs[1800:] classifier = tree.DecisionTreeClassifier()
classifier.fit(training_inputs, training_outputs) predictions =
classifier.predict(testing_inputs) accuracy = 100.0 *
accuracy_score(testing_outputs, predictions) print ("The accuracy of
Decision Tree on testing data is: " + str(accuracy)) testSet =
[[1,1,1,1,1,1,1,1,1,1,1,1,1]] test = pd.DataFrame(testSet)
predictions = classifier.predict(test) print('DT Prediction
on the first test set is:',predictions) testSet =
[[0,0,0,0,0,0,0,0,0,0,0,0,0]] test = pd.DataFrame(testSet)
predictions = classifier.predict(test) print('DT Prediction
on the second test setnis:',predictions) testSet =
[[0,1,0,1,0,1,0,1,0,1,0,1,0]] test = pd.DataFrame(testSet)

40
predictions = classifier.predict(test) print('DT Prediction
on the third test set is:',predictions)
print("Note: 0 indicates Low probable Phising Site,1 indicates Medium
probable Phising Site,2 indicates High probable Phising Site ")

Streamlit Code:

import streamlit as st
from sklearn.neural_network import MLPClassifier
import numpy
import pandas as pd
from sklearn.metrics import accuracy_score

pd.options.mode.chained_assignment = None # default='warn'

df = pd.read_csv('phisset.csv')

df["Website_Char1"] = df["Website_Char1"].map({'MysteriousLinks': 1,
'NoMysteriousLinks': 0})
df["Website_Char2"] = df["Website_Char2"].map({'NonWorkingLinks': 1,
'WorkingLinks': 0})
df["Website_Char3"] = df["Website_Char3"].map({'LessThanOneYear': 1,
'MoreThanOneYear': 0})
df["Website_Char4"] = df["Website_Char4"].map({'DiffDomainFavicon': 1,
'SameDomainFavicon': 0})
df["Website_Char5"] = df["Website_Char5"].map({'SlowDownPerf': 1,
'NoSlowDownPerf': 0})
df["Url_Prop1"] = df["Url_Prop1"].map({'LessThan54': -1, '54to75': 0,
'MoreThan75': 1})
df["Url_Prop2"] = df["Url_Prop2"].map({'DifferentPositionedSlash': 1,
'CorrectPositionedSlash': 0})
df["Url_Prop3"] = df["Url_Prop3"].map({'IPAddressPart': 1,
'NoIPAddressPart': 0})
df["Url_Prop4"] = df["Url_Prop4"].map({'ShortendURL': 1, 'NoShortendURL':
0})
df["Url_Prop5"] = df["Url_Prop5"].map({'@symbol': 1, 'No@symbol': 0})
df["Url_Prop6"] = df["Url_Prop6"].map({'DashPresent': 1, 'DashNotPresent':
0})
df["Url_Prop7"] = df["Url_Prop7"].map({'2dots': -1, '3dots': 0,
'MoreThan3dots': 1})
df["Url_Prop8"] = df["Url_Prop8"].map({'HTTPSNotAdded': 1, 'HTTPSAdded':
0})
df["Phis_Probability"] = df["Phis_Probability"].map({'Medium': 1, 'Low': 0,
'High': 2})

data = df[
["Website_Char1", "Website_Char2", "Website_Char3", "Website_Char4",
"Website_Char5", "Url_Prop1", "Url_Prop2",
"Url_Prop3",
"Url_Prop4", "Url_Prop5", "Url_Prop6", "Url_Prop7", "Url_Prop8",
"Phis_Probability"]].to_numpy()

st.title('Detection of Phishing Websites')

41
web_char_1 = st.number_input("Website containing any mysterious links :
Enter 0 for No 1 for Yes", 0, 1, 0)
web_char_2 = st.number_input("Website containing any Non-Working links :
Enter 0 for No 1 for Yes", 0, 1, 0)
web_char_3 = st.number_input("Website existing for less than one year :
Enter 0 for No 1 for Yes", 0, 1, 0)
web_char_4 = st.number_input("Favicon loaded from different domain : Enter
0 for No 1 for Yes", 0, 1, 0)
web_char_5 = st.number_input("Website slows down the performance of the
explorer: Enter 0 for No 1 for Yes", 0, 1, 0)
URL_prop_1 = st.number_input("URL length : Enter -1 if length <54; 0 if
54<length<75 ; 1 if length>75", -1, 1, 0)
URL_prop_2 = st.number_input("URL containing double slash at different
Position: Enter 0 for No 1 for Yes", 0, 1, 0)
URL_prop_3 = st.number_input("IP address is used as a part of URL : Enter 0
for No 1 for Yes", 0, 1, 0)
URL_prop_4 = st.number_input("Shortened URL : Enter 0 for No 1 for Yes", 0,
1, 0)
URL_prop_5 = st.number_input("URL having @ symbol : Enter 0 for No 1 for
Yes",0,1,0)
URL_prop_6 = st.number_input("Dash symbol is present in URL : Enter 0 for
No 1 for Yes", 0, 1, 0)
URL_prop_7 = st.number_input("No. of dots in the URL : Enter -1 for 3
dots ; 0 for 3 dots ; 1 for more than 3", 0,
1, 0)
URL_prop_8 = st.number_input("HTTPS is not added on the domain part of the
URL : Enter 0 for No 1 for Yes", 0, 1, 0)

inputs = data[:, :-1]


outputs = data[:, -1]
training_inputs = inputs[:1800]
training_outputs = outputs[:1800]
testing_inputs = inputs[1800:]
testing_outputs = outputs[1800:]

classifier = MLPClassifier()
classifier.fit(training_inputs, training_outputs) # classifier to learn
fom data.

predictions = classifier.predict(testing_inputs)

accuracy = 100.0 * accuracy_score(testing_outputs, predictions)

print("The accuracy of MLPC on testing data is: " + str(accuracy))

# testSet = [[1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1]]


# test = pd.DataFrame(testSet)
# predictions = classifier.predict(test)
# print('MLPC Prediction on the first test set is:', predictions)
# testSet = [[1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0]]
# test = pd.DataFrame(testSet)
# predictions = classifier.predict(test)
# print('MLPC Prediction on the second test setnis:', predictions)
# testSet = [[0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1]]
# test = pd.DataFrame(testSet)
# predictions = classifier.predict(test)
# print('MLPC Prediction on the third test set is:', predictions)

if st.button("detect the level of phishing"):

42
testdata = [web_char_1, web_char_2, web_char_3, web_char_4, web_char_5,
URL_prop_1, URL_prop_2, URL_prop_3,
URL_prop_4,
URL_prop_5, URL_prop_6, URL_prop_7, URL_prop_8]
t_data = [testdata]
test_phish = pd.DataFrame(t_data)
predict = classifier.predict(test_phish)
if predict==0:
st.subheader('As per the input given,MLPC classifier classifies
the website as a low phishing website')
elif predict==1:
st.subheader('As per the information,MLPC classifier classifies
the website as a medium phishing website')
elif predict==2:
st.subheader('As per the information,MLPC classifier classifies
the website as a high phishing website')

43
7.3 RESULTS
phisdet1.py

Fig 7.3.1 Result of phisdet1.py


Urlprop.py

Fig 7.3.2 URL properties Window

44
Sitechar1.py

Fig 7.3.3 Website characteristics Window


datasubset1.py

Fig 7.3.4 Graph on Mysterious links vs Phishing probability

45
Mlpc1.py and Dt1.py

Fig 7.3.5 Results using Multilayer perceptron and Decision tree

46
Streamlit results:

47
48
CHAPTER-8

TESTING

8.1 INTRODUCTION OF TESTING

SOFTWARE TESTING is defined as an activity to check whether the actual results

match the expected results and to ensure that the software system is Detect free. It involves
the execution of a software component or system component to evaluate one or more
properties of interest.It is required for evaluating the system. This phase is the critical phase
of software quality assurance and presents the ultimate view of coding.

Importance of Testing

The importance of software testing is imperative. A lot of times this process is skipped,
therefore the product and business might suffer. To understand the importance of testing, here
are some of the key points to explain
➔ Software testing saves money
➔ Provides security
➔ Improves product quality
➔ Customer Satisfaction
Testing is of different ways. The main idea behind testing is to reduce the errors and to do it
with a minimum time and effort.

Benefits of Testing

Cost-Effective: It is one of the important advantages of software testing. Testing any IT


project on time helps you to save your money for the long term. In case of any bugs caught in
the earlier stage of software testing, it costs less to fix.
Security: It is the most vulnerable and sensitive benefit of software testing. People are
looking for trusted products. It helps in removing risks and problems earlier

Product quality: It is an essential requirement of any software product. Testing ensures a

49
quality product is delivered to customers.

Customer Satisfaction: The main aim of any product is to give satisfaction to their
customers. UI/UX Testing ensures the best user experience.

Different types of Testing:

Unit Testing: Unit tests are very low level, close to the source of your application. They

consist of testing individual methods and functions of the classes , components or modules

used by your software.Unit tests are in general quite cheap to automate and can be run very

quickly by a continuous integration server.

Integration Testing: Integration tests verify that different modules or services used by your

application work well together. For example, it can be testing the interaction with the

database or making sure that microservices work together as expected. These types of tests

are more expensive to run as they require multiple parts of the application to be up and

running.

Functional Tests: Functional tests focus on the business requirements of an application.

They only verify the output of an action and do not check the intermediate states of the

system when performing that action. There is sometimes a confusion between integration

tests and functional tests as they both require multiple components to interact with each other.

The difference is that an integration test may simply verify that you can query the database

while a functional test would expect to get a specific value from the database as defined by

the product requirements.

Regression Testing: Regression testing is a crucial stage for the product and very useful

for the developers to identify the stability of the product with the changing requirements.

Regression testing is a testing that is done to verify that a code change in the software does

not impact the existing functionality of the product.

50
System Testing: System testing of software or hardware is testing conducted on a complete

integrated system’s compliance with its specific requirements. System testing is a series of

different tests whose primary purpose is to fully exercise the computer-based system

Performance Testing: It checks the speed, response time, reliability, resource usage,
scalability of a software program under the expected workload. The purpose of Performance
Testing is not to find functional defects but to eliminate performance bottlenecks in the
software or device.

Alpha Testing: This is a form of internal acceptance testing performed mainly by the house
software QA and testing teams. Alpha testing is the latest testing done by the test teams at the
development site after the acceptance testing and before releasing the software for the beta
test. It can also be done by the potential users or customers of the application. But still, this is
a form of in-house acceptance testing.

Beta Testing: This is a testing stage followed by the internal full alpha test
cycle. This is the final testing phase where the companies release the software to a few
external user groups outside the company test teams or employees. This initial software
version is known as the beta version. Most companies gather user feedback in this release.

Black Box Testing: It is also known as Behavioural testing, is a software testing method in
which the internal structure or design or implementation of the item being

tested is not known to the tester. These tests can befunctional or non-functional, though
usually functional.

51
Fig 8.1.1 black box testing

This method is named so because the software program, in the eyes of the tester, is like
a black box; inside which one cannot see. This method attempts to find errors in the
following categories:

➢ Incorrect or missing functions


➢ Interface errors
➢ Errors in data structures or external database access
➢ Behaviour or performance errors
➢ Initialization or termination errors

White Box Testing: White box Testing(also known asClear Box Testing, Open box
Testing, Glass Box Testing, Transparent Box Testing, Code Based Testing or Structural
Testing) is a software testing method in which the internal structure or design or
Implementation of the item being tested is known to the tester. The tester chooses inputs to
exercise paths through the code and determines the appropriate outputs. Programming know-
how and the implementation knowledge is essential. White box testing is testing beyond the
user interface and into the nitty-gritty of a system. This method is named so because the
software program, in the eyes of the tester, is like a white/transparent box; inside which one
clearly sees.

Fig 8.1.2 white box testing

Multi Layer Perceptron Testing : Multi Layer Perceptron testing is the testing after
modification of a system, component, or a group of related units to ensure that the

52
modification is working correctly and is not damaging or imposing other modules to produce
unexpected results. It falls under the class of black box testing.

8.2 TEST CODE

mlpc.py

from sklearn.neural_network import MLPClassifier


import numpy as np import pandas as pd from
sklearn import *
from sklearn.metrics import accuracy_score
df = pd.read_csv('phisset.csv')
df["Website_Char1"=df["Website_Char1"].map({'MysteriousLinks':1,'NoMysteriousLinks':0
})
df["Website_Char2"] = df["Website_Char2"].map({'NonWorkingLinks':1,'WorkingLinks':0})
df["Website_Char3"]=df["Website_Char3"].map({'LessThanOneYear':1,'MoreThanOneYear'
:
0})
df["Website_Char4"]=df["Website_Char4"].map({'DiffDomainFavicon':1,'SameDomainFavi
c on':0})
df["Website_Char5"]= df["Website_Char5"].map({'SlowDownPerf':1,'NoSlowDownPerf':0})
df["Url_Prop1"] = df["Url_Prop1"].map({'LessThan54':-1,'54to75':0,'MoreThan75':1})
df["Url_Prop2"]=df["Url_Prop2"].map({'DifferentPositionedSlash':1,'CorrectPositionedSlash'
:0})
df["Url_Prop3"] = df["Url_Prop3"].map({'IPAddressPart':1,'NoIPAddressPart':0})
df["Url_Prop4"] = df["Url_Prop4"].map({'ShortendURL':1,'NoShortendURL':0})
df["Url_Prop5"] = df["Url_Prop5"].map({'@symbol':1,'No@symbol':0}) df["Url_Prop6"] =
df["Url_Prop6"].map({'DashPresent':1,'DashNotPresent':0}) df["Url_Prop7"] =
df["Url_Prop7"].map({'2dots':-1,'3dots':0,'MoreThan3dots':1}) df["Url_Prop8"] =
df["Url_Prop8"].map({'HTTPSNotAdded':1,'HTTPSAdded':0}) df["Phis_Probability"] =
df["Phis_Probability"].map({'Medium':1,'Low':0,'High':2})
df[["Website_Char1","Website_Char2","Website_Char3","Website_Char4","Website_Char5
",

53
"Url_Prop1","Url_Prop2","Url_Prop3","Url_Prop4","Url_Prop5","Url_Prop6","Url_Prop7","
Url_Prop8","Phis_Probability"]].to_numpy() inputs = data[:,:-1] outputs = data[:, -1]
training_inputs = inputs[:1800]
training_outputs = outputs[:1800]
testing_inputs = inputs[1800:]
testing_outputs = outputs[1800:]
classifier = MLPClassifier()
classifier.fit(training_inputs, training_outputs)
predictions = classifier.predict(testing_inputs)
accuracy = 100.0 * accuracy_score(testing_outputs, predictions)
print ("The accuracy of MLPC on testing data is: " + str(accuracy))
testSet = [[1,1,1,1,1,1,0,0,1,1,1,1,1]] test = pd.DataFrame(testSet)
predictions = classifier.predict(test)
print('MLPC Prediction on the first test set is:',predictions)
testSet = [[1,1,0,1,0,1,0,1,1,1,0,0,0]] test =
pd.DataFrame(testSet) predictions = classifier.predict(test)
print('MLPC Prediction on the second test set is:',predictions)
testSet = [[0,1,1,1,0,1,0,1,0,1,0,1,1]] test =
pd.DataFrame(testSet) predictions = classifier.predict(test)
print('MLPC Prediction on the third test set is:',predictions)
print("Note: 0 indicates Low probable Phishing Site,1 indicates Medium probable Phishing
Site,2 indicates High probable Phishing Site ")

54
8.3 TEST CASE

MLPC CLASSIFICATION
[0]-low probable phishing site
[2]-Medium probable phishing site
[2]-High probable phishing site

INPUT PREDICTED OUTPUT ACTUAL OUTPUT


testSet = MLPC Prediction on the test MLPC Prediction on the test
[[1,1,1,1,1,1,1,1,1,1,1,1,1]] set is :[2] set is :[2]

testSet = MLPC Prediction on the test MLPC Prediction on the test


[[0,0,0,0,0,0,0,0,0,0,0,0,0]] set is :[0] set is :[0]

testSet = MLPC Prediction on the test MLPC Prediction on the test


[[0,1,0,1,0,1,0,1,0,1,0,1,0]] set is :[1] set is :[1]

testSet = MLPC Prediction on the test MLPC Prediction on the test


[[0,1,1,1,0,1,0,1,0,1,0,1,1]] set is :[1] set is :[1]

testSet = MLPC Prediction on the test MLPC Prediction on the test


[[1,1,0,1,0,1,0,1,1,1,0,0,0]] set is :[1] set is :[1]

testSet = MLPC Prediction on the test MLPC Prediction on the test


[[1,1,1,1,1,1,0,0,1,1,1,1,1]] set is :[2] set is :[2]
∙∙

CHAPTER-9

CONCLUSION
55
Phishing Targets naive online users tricking into revealing sensitive information such as
username, password, social security number or credit card number etc. Attackers fool the
Internet users by masking the webpage as a trustworthy or legitimate page to retrieve
personal information. There are many anti-phishing solutions such as blacklist or whitelist,
heuristic and visual similarity-based methods proposed to date, but online users are still
getting trapped into revealing sensitive information in phishing websites.

This project entitled “Detection of Phishing Websites using Deep Learning Algorithms.”
is useful in classifying the phishing web sites data set using decision tree classification and
Multi Layer Perceptron Classification. The project is useful to the newbies of social media to
protect them against phishing URLs. This project finally leads to the improvement of quality
of the world wide web.

56
CHAPTER-10

FUTURE SCOPE

Phishing may never go out of season, but with the right approach one can minimise the
risk that your organisation will ever get hooked. This prototype has a very great potential
to be further improved in the future. As of now, the project is tested by using the data set
provided by Proofpoint containing data generated in the USA. The applicability of this
project to other similar data sets with the other countries data, need to be explored.

57
CHAPTER-11

REFERENCES

[1] S.Aarthi, Narsepalli Vamsi Kishan, V.Surya Teja, N.V.Harsha Vardhan


Gupta:Classification of Phishing Website Based on URL Features.Vol 7, Issue5, International
Journal of Emerging Technologies in Engineering Research

[2] Aburrous , Hossain MA, Dahal K, Thabtah F. Intelligent phishing detection system
fore-banking using fuzzy data mining, Expert Systems with Applications

[3] M.Arunkrishna,B.Mukunthan:A MULTI-CLASSIFIER APPROACH FOR


TWITTER
SPAM DETECTION USING INNOVATIVE ANN-FDT ALGORITHM,e-ISSN : 0976-5166

[4] https://nevonprojects.com/detecting-phishing-websites-using-machine-learning/

[5] https://ieeexplore.ieee.org/document/8769571

58

You might also like