[go: up one dir, main page]

0% found this document useful (0 votes)
115 views26 pages

A Summer Internship Report On Naïve Bayes-Spam Classifier: Prepared by

The document is a summer internship report submitted by Jay Desai, Vishvesh Khandpur, and Dharan Thaker to Charotar University of Science and Technology (CHARUSAT) under the guidance of Prof. Minal Shah. The report details the internship work done on developing a Naive Bayes spam classifier to detect spam emails. It includes an introduction, techniques used like TF-IDF and multinomial Naive Bayes, neural network approaches like ANN and RNN, and a results analysis section.

Uploaded by

JAY DESAI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
115 views26 pages

A Summer Internship Report On Naïve Bayes-Spam Classifier: Prepared by

The document is a summer internship report submitted by Jay Desai, Vishvesh Khandpur, and Dharan Thaker to Charotar University of Science and Technology (CHARUSAT) under the guidance of Prof. Minal Shah. The report details the internship work done on developing a Naive Bayes spam classifier to detect spam emails. It includes an introduction, techniques used like TF-IDF and multinomial Naive Bayes, neural network approaches like ANN and RNN, and a results analysis section.

Uploaded by

JAY DESAI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

A

Summer Internship Report


On
Naïve Bayes- Spam Classifier
(CE446 – Summer Internship - II)

Prepared by
Jay Desai (17CE025),
Vishvesh Khandpur (17CE043),
Dharan Thaker (17CE126)

Under the Supervision of


Prof. Minal Shah

Submitted to
Charotar University of Science & Technology (CHARUSAT)
for the Partial Fulfillment of the Requirements for the
Degree of Bachelor of Technology (B.Tech.)
for Semester 7

Submitted at

Accredited with Grade A by NAAC


Accredited with Grade A by KCG

U & P U. PATEL DEPARTMENT OF COMPUTER ENGINEERING


(NBA Accredited)
Chandubhai S. Patel Institute of Technology (CSPIT)
Faculty of Technology & Engineering (FTE), CHARUSAT
At: Changa, Dist: Anand, Pin: 388421.
2020
Accredited with Grade A by NAAC
Accredited with Grade A by KCG

CERTIFICATE

This is to certify that the report entitled “Naïve Bayes- Spam Classifier” is a bonafied
work carried out by Jay Desai (17CE025), Vishvesh Khandpur (17CE043), Dharan
Thaker (17CE126) under the guidance and supervision of Prof. Minal Shah for the
subject Summer Internship – II (CE446) of 7th Semester of Bachelor of Technology
in Computer Engineering at Chandubhai S. Patel Institute of Technology (CSPIT),
Faculty of Technology & Engineering (FTE) – CHARUSAT, Gujarat.

To the best of my knowledge and belief, this work embodies the work of candidate
themselves, has duly been completed, and fulfills the requirement of the ordinance
relating to the B.Tech. Degree of the University and is up to the standard in respect of
content, presentation and language for being referred by the examiner(s).

Under the supervision of,

Prof. Minal Shah


Assistant Professor
U & P U. Patel Dept. of Computer Engineering
CSPIT, FTE, CHARUSAT, Changa, Gujarat

Dr. Ritesh Patel


Head - U & P U. Patel Department of Computer Engineering,
CSPIT, FTE, CHARUSAT, Changa, Gujarat.

Chandubhai S. Patel Institute of Technology (CSPIT)


Faculty of Technology & Engineering (FTE), CHARUSAT
At: Changa, Ta. Petlad, Dist. Anand, Pin: 388421. Gujarat.
Acknowledgement

This report has been prepared for the internship that has been done as virtual internship
provided by CHARUSAT, in order to study the practical aspect of the course
and implementation of the theory in the real field. The aim of this internship is to be familiar
to the practical aspect and uses of theoretical knowledge and clarifying the career goals, so We
have successfully completed the internship and compiled this report as the summary and the
conclusion that have drawn from the internship experience. We would like to express
our sincere gratitude to our internship coordinator, Prof. Minal Shah, who have given their
valuable time and given me chance to learn something despite having their busy schedule. We
are also thankful to Dr. Ritesh Patel (Head of Department) and other staff member for their co-
operative support, and also presenting with an opportunity for me to have a practical experience
in this organization. Lastly, We would like to thank Mr. Ronak Patel for providing proper
guidance whenever We felt confusion in internship. Thus, the time in this virtual internship
very audacious and supportive to our career through which We have gained valuable work
experience that will help definitely makes a favorable impression on me as a prospective future
employer.

i
Abstract

These report presents tasks completed during period of summer internship of 6 weeks under
virtual internship provided by CHARUSAT lying under Project Name : “Naïve Bayes-
Spam Classifier “.
There has been an immense exponential growth in the spam email transfers, which has led to
the development of some highly advanced and sophisticated email spam-ham detection
software and filters. Majorly all the Email service providers in the market have embedded these
filters along with their Email software and are constantly working on making their spam
detection algorithm more and more vigorous. Various efficient Machine Learning Algorithms
and Neural Networks forms the backbone in the production and development of this anti-spam
filters. Amongst them some popular and highly effective approaches include the Naïve Bayes
algorithm, approach based on TFIDF model, Artificial Neural Network, k-nearest Neighbor
approach and Support Vector Machines. Here in this paper we intend to present the analysis,
description and comparison of such algorithms and approaches, in order to find the best and
most suitable approach for making Email spam detection job simpler and more effectual.

ii
17CE025,17CE043,17CE126 Index

Table of Contents

Acknowledgement...........................................................................................................i
Abstract ...................................................................................................................... ii
Company Certificate ................................................................................................ iii
Description of company / organization………………………………………………. vi
Chapter 1 Introduction..…………………………………………………………………1
1.1 Project Overview ................................................................................................. 1
1.2 Purpose ................................................................................................................. 1
1.3 Scope .................................................................................................................... 1
1.4 Objectives………………………………………………………………………..2

Chapter 2 Naïve Methods..………………………………………………………………3


2.1 TF-IDF Method .................................................................................................... 3
2.2 MultinomialNB() function..………………………………………………….…..5

Chapter 3 Techniques of NN…………………………………………………………….7


3.1 ANN Technique .............. ………………………………………………………..7
3.2 RNN Technique ....................................... ……………………………………….8
3.3 Understanding of existing system............ ……………………………………….9

Chapter 4 LR and SVM…………………………………………………………………10


4.1 LR-SVM...……………………………………………………………………….10
4.2 GUI………………………………………………………………………………11
\
Chapter 5 Result Analysis…..…………………………………………………………...14
5.1 Result………………….…………………………………………………………14
5.2 Analysis…….……………………………………………………………………15

Conclusion……………………………..…………………………………………….16
References ................................................................................................................... 17

CSPIT iv U & P U. Patel Department of Computer Engineering


17CE025,17CE043,17CE126 Index

List of Figures
Fig 4.1 Home Page ....................................................................................................... 12
Fig 4.2 Input Message .................................................................................................. 12
Fig 4.3 Result of Input………………………………………………………………..13
Fig 4.4 Hovering over Result Circle ............................................................................ 13

CSPIT v U & P U. Patel Department of Computer Engineering


17CE025,17CE043,17CE126 Index

List of Tables

Table 5.1 TF-IDF Result Table .................................................................................... 14


Table 5.2 MultinomialNB() Result Table .................................................................... 14
Table 5.3 ANN Result Table ........................................................................................ 14
Table 5.4 RNN Result Table ........................................................................................ 15
Table 5.5 LR-SVM Result Table ................................................................................. 15
Table 5.6 Analysis Table.............................................................................................. 15

CSPIT vi U & P U. Patel Department of Computer Engineering


17CE025,17CE043,17CE126 Description of Company

Description of Company

Company name: Charusat

Company website: https://www.charusat.ac.in/

Company address (Brach where you want to go for internship): Off. Nadiad-Petlad

Highway, Changa 388 421,Anand, Gujarat

Number of employees: 550

Number of branches and branch locations (if any): N/A

Head office address (in case of multiple branches): Off. Nadiad-Petlad Highway, Changa

388 421,Anand, Gujarat

Contact person name: Prof. Minal Shah

Contact person phone number: 9408757507

Contact person email Id: minalmaniar.ce@charusat.ac.in

HR name: N/A

HR phone number: N/A

HR email Id: N/A

Technology (company working on): Education

Current project (if details provided): Naïve Bayes Spam Classifier

How you get this company? : We are students of this institute

Reason to choose this company: After the guidance of TNP cell(Prof. Ronak Patel)

CSPIT vi U & P U. Patel Department of Computer Engineering


17CE025,17CE043,17CE126 INTRODUCTION

Chapter – 1 INTRODUCTION

1.1 PROJECT OVERVIEW

The report talks about “Email SPAM-HAM detection techniques” which is basically used to
identify and avoid the SPAM emails which we receive almost every day. This in turn gives
us only the filtered and legitimate Emails which are crucial to us and we can be safe from
online frauds and attacks. This is achieved by using various Machine Learning and NLP
techniques like TFIDF model, ANN model, Naïve Bayes Algorithm, RNN model and SVM.
We have incorporated almost every possible method to achieve the task and have also shown
the compare and contrast of these methods along with their accuracy and precision score.

1.2 PURPOSE

The purpose of this internship was to explore and learn the various Machine Learning models
in order to make and deploy a real-life sustainable Email SPAM-HAM detection model.

1.3 SCOPE

The Scope of Email SPAM-HAM detection includes:-

• Exploring various Machine Learning and Natural Language Processing


methodologies and algorithms for research purpose.
• Creating a full-fledged Email SPAM-HAM detector which can detect SPAM emails
with utmost accuracy.
• Learning how to deploy the Machine Learning model on the internet.
• Doing the comparative analysis and seeking the best algorithm out of all the
algorithms.

CSPIT 1 U & P U. Patel Department of Computer Engineering


17CE025,17CE043,17CE126 INTRODUCTION

1.4 OBJECTIVES
The Objectives behind the Email SPAM-HAM detection are:-

• To make a holistic and fully functional Email Spam detector which can classify the
SPAM emails.
• To perform research as well as the comparative analysis on the available techniques
and algorithms through which Spam detection can be made possible.
• To carve out the most accurate and precise algorithm which can generate desired
results.
• To deploy the project on the internet.

CSPIT 2 U & P U. Patel Department of Computer Engineering


17CE025,17CE043,17CE126 Naïve Methods

Chapter – 2 Naïve Methods

2.1 TF-IDF Method

Words are represented as vector, where if a word appears in a document then it gets score of
‘1’ and if a word not appear in a document then it gets score of ‘0’ , in the case of method of
“Bag-of-Words.”. Hence, the vector becomes a sequence of 1’s and 0’s. “BOW” method is
not give importance to specific word from text and the words which are appear most get the
weightage as “important”.

In the method of “TF-IDF”, words are also represented as vector but not only 0’s and 1’s are
computed but each words get the score with specific value within 0.0 to 1.0 with the help of
specific equation used as below to calculate. This technique not treated each word equally
like to previous one but works on the word’s frequencies followed by certain equations.

The count of the number of times a word is there, called TF (Term- Frequency) from the
given document. From the corpus of document, the number of times a word present called
IDF (Inverse Document Frequency).

Steps:

1. Import necessary all modules.


2. Load dataset with specific encoding on which all processes done in future.
3. Splitting of data with training and testing by providing in between 0 to 1, normally
0.75 provides which shows from whole dataset 75% is used for training and
remaining 25% is used for testing.
4. Visualizing dataset for better understanding mainly as “Word Cloud” ( Optional / not
needed if user won’t require ).
5. Training of Model
a. Pre-Processing :
- Convert into lowercases
- String tokenizers for creating tokens
- Stemming Algorithm ( Porter Stemmer )
- Remove Stop words
- For improving accuracy, ‘n-grams’ is used which means , “bumper” word
is not spam but “bumper prize” is spam so by combining 2 words more
accuracy is derived as it is useful.

Now, 2 techniques are used here for training the model as below:

CSPIT 3 U & P U. Patel Department of Computer Engineering


17CE025,17CE043,17CE126 Naïve Methods

1) Bag of Words and 2) TF – IDF

The first one is “Bag of Words” describing the probability of that word as “term
frequency” and with respect to “spam” and “ham” as below:

𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑟 𝑜𝑓 𝑜𝑐𝑢𝑢𝑟𝑎𝑛𝑐𝑒𝑠 𝑜𝑓 𝑤 𝑖𝑛 𝑑𝑎𝑡𝑎𝑠𝑒𝑡


𝑃(𝑤) =
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑤𝑜𝑟𝑑𝑠 𝑖𝑛 𝑑𝑎𝑡𝑎𝑠𝑒𝑡

𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑐𝑢𝑢𝑟𝑎𝑛𝑐𝑒𝑠 𝑜𝑓 𝑤 𝑖𝑛 𝑠𝑝𝑎𝑚 𝑚𝑒𝑠𝑠𝑎𝑔𝑒𝑠


𝑃(𝑤|𝑠𝑝𝑎𝑚) =
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑤𝑜𝑟𝑑𝑠 𝑖𝑛 𝑠𝑝𝑎𝑚 𝑚𝑒𝑠𝑠𝑎𝑔𝑒𝑠

𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑐𝑢𝑢𝑟𝑎𝑛𝑐𝑒𝑠 𝑜𝑓 𝑤 𝑖𝑛 ℎ𝑎𝑚 𝑚𝑒𝑠𝑠𝑎𝑔𝑒𝑠


𝑃(𝑤|ℎ𝑎𝑚) =
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑤𝑜𝑟𝑑𝑠 𝑖𝑛 ℎ𝑎𝑚 𝑚𝑒𝑠𝑠𝑎𝑔𝑒𝑠

The second one is “Term frequency – Inverse Document Frequency” showing relation
with respect to log as below:

𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑚𝑒𝑠𝑠𝑎𝑔𝑒𝑠


𝐼𝐷𝐹(𝑤) = 𝑙𝑜𝑔
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑚𝑒𝑠𝑠𝑎𝑔𝑒𝑠 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑖𝑛𝑔 𝑤

𝑇𝐹(𝑤) ∗ 𝐼𝐷𝐹(𝑤)
𝑃(𝑤) =
∑∀ 𝑤𝑜𝑟𝑑𝑠 𝑥 ∈ 𝑡𝑟𝑎𝑖𝑛 𝑑𝑎𝑡𝑎𝑠𝑒𝑡 𝑇𝐹(𝑥) ∗ 𝐼𝐷𝐹(𝑥)

𝑇𝐹(𝑤|𝑠𝑝𝑎𝑚) ∗ 𝐼𝐷𝐹(𝑤)
𝑃(𝑤|𝑠𝑝𝑎𝑚) =
∑∀ 𝑤𝑜𝑟𝑑𝑠 𝑥 ∈ 𝑡𝑟𝑎𝑖𝑛 𝑑𝑎𝑡𝑎𝑠𝑒𝑡 𝑇𝐹(𝑥|𝑠𝑝𝑎𝑚) ∗ 𝐼𝐷𝐹(𝑥)

b. Smoothing
- There is possibility that a word comes into real-time which is not part of
training dataset at that point of time P(w) becomes 0 which makes
P(spam|w) undefined , which should not be happened at any cost so add a
number for this type of cases mainly as “alpha” selected by user
converting probability to any finite number and used in both method as
below:
- For “Bag of Words”:
𝑇𝐹(𝑤|𝑠𝑝𝑎𝑚)+ ∝
𝑃(𝑤|𝑠𝑝𝑎𝑚) =
∑∀ 𝑤𝑜𝑟𝑑𝑠 𝑥 ∈ 𝑡𝑟𝑎𝑖𝑛 𝑑𝑎𝑡𝑎𝑠𝑒𝑡 𝑇𝐹(𝑥)+ ∝ ∗ (∑∀ 𝑤𝑜𝑟𝑑𝑠 𝑥 ∈ 𝑠𝑝𝑎𝑚 𝑖𝑛 𝑡𝑟𝑎𝑖𝑛 𝑑𝑎𝑡𝑎𝑠𝑒𝑡 1)
- For “TF -IDF”:

𝑃(𝑤|𝑠𝑝𝑎𝑚)
𝑇𝐹(𝑤|𝑠𝑝𝑎𝑚) ∗ 𝐼𝐷𝐹(𝑤)+ ∝
=
∑∀ 𝑤𝑜𝑟𝑑𝑠 𝑥 ∈ 𝑡𝑟𝑎𝑖𝑛 𝑑𝑎𝑡𝑎𝑠𝑒𝑡 𝑇𝐹(𝑥) ∗ 𝐼𝐷𝐹(𝑥)+ ∝ ∗ (∑∀ 𝑤𝑜𝑟𝑑𝑠 𝑥 ∈ 𝑠𝑝𝑎𝑚 𝑖𝑛 𝑡𝑟𝑎𝑖𝑛 𝑑𝑎𝑡𝑎𝑠𝑒𝑡 1)

CSPIT 4 U & P U. Patel Department of Computer Engineering


17CE025,17CE043,17CE126 Naïve Methods

6. Classification
- Find out P(w|spam) but if w is not from train dataset then TF(w) becomes
0 and product of P(w|spam) with P(spam) gives us P(spam|message).
- Same process for finding out P(ham|message).
- The one has greater probability will be consider as that corresponding tag
whether it is “spam” or “ham”.
-
2.2 MultinomialNB() function

One of the most popular and highly used algorithm in Text Classification is Bayesian
equation-based algorithms. This Naïve Bayes algorithm was proposed for spam detection in
the year 1998. In here, the magic is all about the probabilities of occurrences of words that
might be present in the Email. For instance, if a particular word is found to be occurring in
spam quite often, then it’s likely that the Email is Spam. The algorithm is based on the
simple idea that if the probability of the word exceeds a certain limit, then the filter decides
in which category it is supposed to fall.

As it is a Bayesian algorithm, it is based on the Bayesian equation, which is as follows:-

𝑃(𝐵|𝐴) ∗ 𝑃(𝐴)
𝑃(𝐴|𝐵) =
𝑃(𝐵)

This equation gives the solution of the probability of occurrence of A in the presence of B.
With the help of this equation, we can find out the spam rating of the particular word which
is obtained by tokenizing the sentences.

𝐶𝑠𝑝𝑎𝑚 (𝑇)
𝑆[𝑇] =
𝐶𝑠𝑝𝑎𝑚 (𝑇) + 𝐶ℎ𝑎𝑚 (𝑇)

Here, Cspam(T) and Cham(T) are the number of spam and ham messages which happens to
occur in that particular tokenized word. Once the spam rating for all the tokenized keywords
have been established, then final spam rating is calculated by adding all of them. And this is
how Naïve Bayes method works.

Steps:

1. Reading the Data and Cleaning it

First things first, the data is being imported with the help of pandas library and then on the
columns and features which have negligible impact on data or no impact on data are
truncated.

CSPIT 5 U & P U. Patel Department of Computer Engineering


17CE025,17CE043,17CE126 Naïve Methods

2. Labelling the significant features i.e SPAM and HAM

It is also important to map the only two important parameters which are SPAM and HAM
with the numerical values 1 and 0.

3. Tokenizing and Feature extraction[9]

The next thing in the process is to tokenize the message that is obtained into individual’s
meaningful words, so that every word gets ready to pass on the Naïve Bayes function. This
can be achieved by using multiple ways, one of which is by using CountVectorizer().Then
fit_transform() is called on the Training data set which fits the data and decides which tokens
have to be taken into consideration and how can they correspond to entries in the count table.

4. Splitting the data into Train/Test set and applying the Naïve Bayes Algorithm

The dataset has to be split into Training and testing dataset with the help of train_test_split().
And now the program is ready to be passed on to Naïve Bayes function. MultinomialNB() is
an in-built function in sklearn which does all the heavy lifting for us. So split data is passed
to the function and output is generated.

5. Pickling

The output obtained is now stored in the pickle files, which makes it easier to send over a
network.

CSPIT 6 U & P U. Patel Department of Computer Engineering


17CE025,17CE043,17CE126 Techniques of NN

Chapter – 3 Techniques of NN

3.1 ANN Technique

Neural networks are the collection of Artificial neurons, and just like information flows in
the human body through neurons, here information is flowing in the neural network during
its learning phase.

Neural networks work on the principle of self-learning whenever data is passed to them. Here
in this case, when Email data is passed to the network, it tends to learn from itself and tries to
figure out whether the Email is spam or ham. Weighting the ANN connections is very
important for the network to self-learn. Like any other machine learning procedural
algorithm, this as well requires three steps Data pre-processing followed by Data Training
and testing. Data Pre-processing removes all the rawness and impurities from the data set and
makes it completely clean so that it can be fed to neural network.

Thus, only relevant data or features are passed to be processed by the neural network.
Finally, the neural network generates an Email Classifier with the help of the data provided
and its self-learning ability. How accurately the network learns decides the efficiency of the
neural network.

Steps:

1. Data Loading , pre-processing and Labelling

Here the first thing to be done is to load the csv file and prepare the raw data for training and
testing. Drop the irrelevant columns and keep only the ones which can be helpful to make a
neural network. Labelling is done in which Ham is represented by 0 and Spam is represented
by 1.

2. Data Preparation

Here the Training and testing data set are bifurcated into two equal halves for training and
testing purpose.

3. Developing a Predictive Theory and its validation

Next step along the line is developing a logical theory that makes sense, which can be
applied to the algorithm. The Theory is pretty simple. The Neural network has to be made to
learn certain words which are associated with the Spam E-mails only and based on this
criteria we can classify the text to be Spam or Ham. Now this hypothesis has to be validated

CSPIT 7 U & P U. Patel Department of Computer Engineering


17CE025,17CE043,17CE126 Techniques of NN

by making the spam_ham_ratios and differentiate the words which are mostly seen SPAM
label with that of HAM label.

4. Conversion of text into numbers

Neural Networks only understand numbers. Thus, we need to convert all the words into some
corresponding number which can be understood and taken as a input by the Neural Network.
Here corresponding numbers to the words are deduced in such a way that the higher count
might have a higher weight in determining whether a text is spam or not.

5. Building a Spam Classification Neural Network and training it

The last and final step is to build a neural network which takes the numerical number
associated with each word and based on that decides whether the Email is Spam or Ham.
Finally, it is Trained with the help of .train() method.

3.2 RNN Technique

Recurrent Neural Network (RNN) is one of NN which is worked on the mechanism of


“output is depend on previous step’s output feed as input to current step”. Reason behind
used of RNN in spam classifier is simple as , to predict the next word of sentence with the
help of previous word which need to memorize. There is a concept of “Hidden Layer” in
RNN which is most important feature. To memorize previous state and used into current state
can be achieved using “Hidden states” only.

In this technique, model consists of a input sequences as text converted into embedding
vectors of numbers correspond to that word, and RNN classified based on the last output of
model after passing from several hidden states of RNN and central functions which took
input as output of previous one and pass this result to next NN.

RNN has high computational power which made training computation but the ability of
parallel processing solved problem as sequence length is increase. RNN are able to learn high
level features by its own with the presence of raw data. RNN is like human brains, which are
able to process information occurred in past, for the finding solution of current problem.

An RNN can remember each and every information as it is useful in prediction only because
of the feature of remembering previous inputs, called “Long Short-Term Memory” , one kind
of variant of RNN. The main idea behind using “LSTM” is only that RNN is facing issue
with short term memory so that RNN may leave out some important information.

CSPIT 8 U & P U. Patel Department of Computer Engineering


17CE025,17CE043,17CE126 Techniques of NN

Steps:

1. Import Libraries

Needs Keras libraries module to the code in order to use Keras without any error.

2. Load Dataset

From .txt file data is loaded into the python variable for future use.

3. Preparing Dataset

As NN only works on integers, we need to vectorize text corpus by turning text into a
sequence of integers. Precisely, fixed length sequences of integers are necessary in NN for
future computation.

In this step, one has to remove punctuations, convert into lowercases, and then converting
text to sequence of numbers. But there is a problem as not all text converts into fixed length
sequences so that need to pad extra 0’s to make all sequences of same length.

At last, data is split into 2 categories: training and testing.

4. Build Model

In this step, first is “embedding layers” which are pre-trained mapping word to N-
dimensional vector form the help of “Glove”, which is an unsupervised learning algorithm
used for getting vector representation of words.

Second is “LSTM”[32] units which takes inputs as output of “embedding layers” and passes
only 2 neurons to “softmax” activation function related to either “spam” or “ham”.

In this method, “Drop-out” is about 0.3 means, RNN will freeze 30% of neurons in previous
layer for each iteration which helps to reduce overfitting of model.

5. Train Model

Training of model will save log file for better visualization of training loss, model accuracy
and many more parameters which can be further elaborated if required.

6. Evaluate Model

The evaluation of this model is provided inside the result section of this method.

7. Customize Input to Model

Now, User can check model by sending a text which he/she want to test whether it is “spam”
or “not”.

CSPIT 9 U & P U. Patel Department of Computer Engineering


17CE025,17CE043,17CE126 LR and SVM

Chapter – 4 LR and SVM

4.1 LR-SVM

From the type of Logistic Regression, “Binary Logistic Regression” is used as it categorical 2
outcomes, “spam” or “ham”. Logistic Regression is one type of “predictive analysis” which is
used to conduct when variable is “binary”. It used to describes data and relationship between
one dependent to independent variable. It is also used for regression problems where the
datasets are used to train the SVM to classify any new data that it receives.

Based on SVM, various schemes have been proposed through text classification approaches
(TC). A central issue when using SVM is the choice of kernels as they directly affect the
separation of emails in the feature space. The majority of kernels used in recent studies involve
continuous data and neglect the structure of the text. In contrast to classical kernels, they have
proposed the use of various string kernels for spam filtering. They show how effectively string
kernels suit the spam filtering problem. On the other hand, data preprocessing is a crucial part
of text classification where the objective is to generate feature vectors that are usable by SVM
kernels. They have detailed a feature mapping variants in TC that yield improved performance
for the standard SVM in filtering task.

Main objective of “SVM (Support Vector Machine)” is to find hyperplane in N-dimensional


space which classifies distinctly data points into fixed categories. One has to classify a data
point into one hyperplane having maximum margin means maximum distance between 2 data
points of both classes which are called margin maximization.

SVM has been broadly utilized as a part of email spam discovery, still, the difficulty of
managing tremendous information is time and memory devouring and low precision. The
investigation accelerates the computational time of SVM classifiers by decreasing the number
of help vectors. This is finished by the K-implies SVM (KSVM) proposed as part of the work.
Furthermore, the paper also proposes a system for email spam identification in view of half
breed of SVM and K-implies grouping and requires one more information parameter to be
resolved: the number of bunches. The analysis of the proposed instrument was completed
utilizing a spam base standard dataset which is used to assess the plausibility of the proposed
technique. SVM is a splendid solution for the little sample size issue, by developing an
isolating hyperplane to finish the classification. SVM is a splendid solution for the little sample
size issue, by developing an isolating hyperplane to finish the classification.

CSPIT 10 U & P U. Patel Department of Computer Engineering


17CE025,17CE043,17CE126 LR and SVM

Steps:

1. Importing libraries

Need to install sklearn library modules of “linear_model” and “feature_extraction.text” for the
implementation of LR and SVM.

2. Visualization of dataset (optional)

Calculating digits in spam messages, Calculating question marks into spam messages,
Calculating exclamation mark in spam messages and Calculating upper case letters into spam
messages are just for providing visualization of dataset and make table showing all these
together for the sake of better understanding. As it is optional, One can directly jump to next
point without passing from this step.

3. Preprocessing

Removing punctuation marks from message, Converting into lower cases, removing stopwords
from message. For converting text to matrix of token counts using CountVectorizer() , a library
function of “feature_extraction.text”, one kind of “SVM”. Then using transform() function,
returns “Term-document matrix” of the vocabulary dictionary which is learned and create by
CountVectorizer() function.

4. Logistic Regression

One has to split data into training and testing with the help of either “test_size” parameter or
“train_size”. One interesting parameter of splitting data into train and test is “random_state”.
In many professional examples of scikit , it is equal to 42, which suggests same number of
sequences of random numbers is generated every time code is run. If this parameter is set to 0,
then every time when code runs, random number generator that will decide the splitting of data
into 2 portion.

5. Evaluate model

The evaluation of this model is provided inside the result section of this method.

4.2 GUI

We created GUI using Python Django so that any user can easily interact with our system. In

this GUI, we use RNN Technique into the backend. We also attach some screenshots as

follows:

CSPIT 11 U & P U. Patel Department of Computer Engineering


17CE025,17CE043,17CE126 LR and SVM

Fig 4.1. Home Page

Fig 4.2. Input Message

CSPIT 12 U & P U. Patel Department of Computer Engineering


17CE025,17CE043,17CE126 LR and SVM

Fig 4.3. Result of Input

Fig 4.4. Hovering over Result Circle

CSPIT 13 U & P U. Patel Department of Computer Engineering


17CE025,17CE043,17CE126 Result Analysis

Chapter – 5 Result Analysis

5.1 Result

TF-IDF :-

From below table one can easily state that “TF-IDF” is very better in terms of not only
accuracy but also compare with precision , recall and F-score. But at the same time “Bag of
words” is also very crucial as difference in accuracy and precision is not that much big so
both are useful and very accurate.

Table 5.1 TF-IDF Result Table

Method Precision Recall F-score Accuracy


Bag of Words ~85% ~64% ~73% ~94%
TF-IDF ~87% ~77% ~82% ~96%

MultinomialNB() function:-

From result table, We can clearly identified that Accuracy is improved and nearer to 98%
which shows the strength of MultinomialNB() and this is the reason why Naïve-Bayes is so
useful in spam classifier whether it is sms spam or email spam.

Table 5.2 MultinomialNB() Result Table

Method Precision Recall F-score Accuracy


MultinomialNB ~93% ~92% ~92% ~98%

ANN Technique:-

From the result table, Using Artificial Neural Network is so strong that accuracy is nearer to
99% and precision is almost 99%. It clearly shows that only library functions are not
sufficient for better result so that to train a model and then used that model enhances the
accuracy.

Table 5.3 ANN Result Table

Method Precision Recall F-score Accuracy


ANN ~99% ~90% ~95% ~99%

CSPIT 14 U & P U. Patel Department of Computer Engineering


17CE025,17CE043,17CE126 Result Analysis

RNN Technique:-

Table 5.4 RNN Result Table

Method Precision Recall F-score Accuracy


RNN ~99% ~98% ~99% ~98%

LR-SVM Technique:-

Table 5.5 LR-SVM Result Table

Method Precision Recall F-score Accuracy


LR-SVM ~97% ~94% ~95% ~98%

5.2 Analysis

Table 5.6 Analysis Table

Method Strong Point Weak Point


TF-IDF Accuracy can be improved Recall is very less and Use of
‘n-grams’ may complicate
the execution of code
MultinomialNB Simple and Effective Not used for real-life
classification
ANN Works better in real-time Recall is not up to the mark
scenario compare to accuracy and
precision
RNN Best result achieved Need to train model and so
that ML concepts need to
understand
LR-SVM Easy Implementation Accuracy cannot be
improved

CSPIT 15 U & P U. Patel Department of Computer Engineering


17CE025,17CE043,17CE126 Conclusion

Conclusion

At the end of all chapters, whatever work completed during summer internship program
under virtual internship held on CHARUSAT helped as future employee. This content is
regarding final words about confession for this internship experience which was quite good
alongside some new type of experience, new tools and technologies which can be helped one
in next future. Some soft skills like how to communicate with different fellows, be punctual ,
be humane , be attentive , be discipline and also how to be a part of team and if someone
from team was not good enough then how to help him/her and overshadow failure of one
group to show that how team works in unity and always be ready to help others. Learning of
Machine Learning was very important in future among them in 1 project was developed so
full of exceptions and errors learnt one how to come out from them. As this internship helped
a lot in overall growth which can build career in very bright manner , So always be thankful
to college, faculties and members who were gave this opportunity.

CSPIT 16 U & P U. Patel Department of Computer Engineering


17CE025,17CE043,17CE126 References

References

Web References:

[1] W.A. Awad, S.M. ELseuofi, “Machine Learning Methods for Spam Email Classification”, pp.1-12, 2011.

[2]https://www.researchgate.net/publication/328907962_A_Comparative_Study_of_Spam_SMS_Detection_Usi
ng_Machine_Learning_Classifiers

[3] Juan Ramos, “Using TF-IDF to Determine Word Relevance in Document Queries”, pp. 1-4

[4] https://dblp.org/rec/journals/corr/abs-1806-06407

[5] https://www.researchgate.net/publication/322791510_Spam_Detection_on_Social_Media_Text

[6] C. Kim, K. B. Hwang, “Naive Bayes classifier learning with feature selection for spam detection in
social bookmarking”, In Proceedings of European Conference on Machine Learning and Principles and
Practice of Knowledge Discovery in Databases (ECML/ PKDD), US, pp.32, 2008.

[7] https://www.kaggle.com/uciml/sms-spam-collection-dataset

[8] http://machinelearning.wustl.edu/mlpapers/paper_files/icml2003_RennieSTK03.pdf

[9] https://nlp.stanford.edu/IR-book/html/htmledition/tokenization-1.html

[10] https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html

[11] https://nlp.stanford.edu/IR-book/html/htmledition/dropping-common-terms-stop-words-1.html

[12] https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html

[13] https://towardsdatascience.com/multinomial-naive-bayes-classifier-for-text-analysis-python-8dd6825ece67

[14] https://scikit-learn.org/stable/modules/naive_bayes.html

[15] http://oaji.net/articles/2017/1992-1514213538.pdf

[16] http://www.mecs-press.net/ijcnis/ijcnis-v10-n1/IJCNIS-V10-N1-7.pdf

[17] https://library.ndsu.edu/ir/bitstream/handle/10365/25492/Email%20Classification%20Using%20a%20Self-
Learning%20Technique%20Based%20on%20User%20Preferences.pdf?sequence=1

[18] https://towardsdatascience.com/introduction-to-artificial-neural-networks-ann-1aea15775ef9

[19] https://www.simplilearn.com/what-is-perceptron-tutorial

[20] https://ieeexplore.ieee.org/abstract/document/8509069/

[21] https://www.researchgate.net/publication/324526325_Text_classification_using_artificial_neural_networks

[22] https://www.geeksforgeeks.org/introduction-to-recurrent-neural-network/

CSPIT 17 U & P U. Patel Department of Computer Engineering


17CE025,17CE043,17CE126 References

[23]https://www.researchgate.net/publication/328759146_Spam_filtering_in_SMS_using_recurrent_neural_net
works

[24] Su, Bolan and Lu, Shijian. (2017) Accurate recognition of words in scenes without character segmentation
using recurrent neural network, Pattern Recognition, Elsevier, Volume 63, Pages 397- 405

[25] Venugopalan, Sub hashini and Xu, Huijuan and Donahue, Jeff and Rohrbach, Marcus and Mooney,
Raymond and Saenko, Kate. (2014) Translating videos to natural language using deep recurrent neural
networks, arXiv preprint arXiv:1412.4729

[26] Guo, Liang and Li, Naipeng and Jia, Feng and Lei, Yaguo and Lin, Jing. (2017) A recurrent neural network
based health indicator far remaining useful life predicti on of bearings, Neurocomputing, Els evier, Volume
240, Pages 98-109

[27] Chen, Yu and Yang, Jian and Qian, Jianjun. (2017) Recurrent neural network for facial landmark detection,
Neurocomputing Elsevier, Volume 219, Pages 26-38

[28] http://ijarcsse.com/Before_August_2017/docs/papers/Volume_6/10_October2016/V6I10-0126.pdf

[29] Sutskever, Ilya, James Martens, and Geoffrey E. Hinton. "Generating text with recurrent neural networks."
Proceedings of the 28th International Conference on Machine Learning (ICML-11). 2011.

[30] https://www.researchgate.net/publication/338071063_Spam_Review_Detection_Using_Deep_Learning

[31] https://www.ijert.org/research/spam-detection-using-knn-back-propagation-and-recurrent-neural-network-
IJERTV4IS090492.pdf

[32] https://towardsdatascience.com/understanding-rnn-and-lstm-f7cdf6dfc14e

[33] https://machinelearningmastery.com/logistic-regression-for-machine-learning/

[34] https://link.springer.com/article/10.1007/s10462-010-9166-x

[35] http://www.ijitee.org/wp-content/uploads/papers/v9i2/B9001129219.pdf

[36] http://ijraset.com/fileserve.php?FID=17693

[37] https://www.ijariit.com/manuscripts/v3i3/V3I3-1608.pdf

[38] https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

[39] https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

CSPIT 18 U & P U. Patel Department of Computer Engineering

You might also like