Irjet V7i7318
Irjet V7i7318
Irjet V7i7318
---------------------------------------------------------------------***----------------------------------------------------------------------
Abstract - Due to COVID-19, there is a rapid growth in Probabilistic Factorization (MPF) in [2] . The data sets
the usage of internet in regards of reading news, were collected from six different popular video websites.
communicating with friends, playing games, working from This model has captured two different such as cross-site as
home and surfing etc. Hence, there is a need for and site-specific preferences. From the work MPF model
understanding user's web browsing behavior by the above achieved accuracy of 97± 2%. Understanding, the user
activities carried via web browsing. With this we can behaviors and access pattern has got a great importance to
improve their browsing experience. So in this paper, we different content provider. Researchers in [3] has carried
present various machine learning(ML) algorithms to predict out work by employing the unique data collected from an
and analyze the current user behaviors. The main objective Internet service provider (ISP) service, and they
of this work is discriminating and classifying the close systematically analyzed the user behaviors and viewing
group to which user is most interested. The event related to patterns across the six most popular content providers.
user's surfing that data is collected using browsing history The classification of data in [4] is done with the help of ML
application. Classification, helps us to identify similar algorithms such as Logistic regression and decision tree
groups of data that has been browsed based on parameters classifier. The decision tree has produced an accuracy
such as most visited, time duration, type of website that about 86%. So, from the results they have concluded that
person has browsed. Algorithms such as k-Nearest neighbor Decision tree is the best compared to Logistic regression.
(KNN), Naives Bayes (NB), Support Vector Machine (SVM), Authors in [5] has experimented on Web log along with
and K means clustering has been compared. From the individual users collecting data from website and
results it is observed that naive Bayesian has predicted with discussed behavior of user using Long short term model
good accuracy of about 90±4. (LSTM). Parallel FP-Growth (PFP), Large page sets based
parallel FP-Growth (LPS-PFP) and most interesting
Key Words: Machine learning; K-Nearest neighbor; pattern-based parallel FP-growth (MIP-PFP) were
Naives Bayes; Support vector machine; User Behavior compared, and MIP-PFP outperformed well compared to
analysis. other algorithms in [7]. Authors in [8], carried out an
experiment using Support vector machines (SVM), and the
1. INTRODUCTION results obtained as prediction performance is better than
Back Propagation Networks (BPN) and got average
Web browsing is defined as searching for useful prediction accuracy about 80%.
information from the web and displaying it to users. There
There are several issues being faced by the researchers
are different ways to obtain desired information from the
from their work such as the amount of data required for
server logs. Web browsing is the process of extracting
the work must very large for ML algorithms so that result
useful information or particular data from server. In this
will be more accurate in preprocessing and prediction.
browsing process, browser find outs the what users are
Here, Preprocessing included challenges such as handling
looking for and most interested in searching on the
of large data. Sometimes that cannot fit in to the memory.
Internet. From the survey we found , most of the authors
Several potential research challenges has been faced in
have carried out their work using Machine learning
working with ML/DL[1].
techniques. This is because, these algorithms gave a good
accuracy result when compared to other techniques like In this paper, we address the above issues discussed by
Artificial neural networks, Deep learning algorithms. collecting real time datasets. For this, we made an
In [1] authors have implemented Machine Learning (ML) experimental set up with 20 computers in a lab and
algorithms, which have demonstrated their effectiveness in requested all the students to browse for an hour without
clustering. The work has been carried out by collecting a any monitoring. Later these real time datasets were
real time datasets from the hospital. The result analysis collected from browsed history of various websites. Figure
shown that the ML algorithms are most effective and 1 represents some of social networks which users
suitable in clustering the data as well as in predicting user interested in browsing. These datasets are given to
behaviors. Authors have proposed a generative model for developed machine learning algorithms. The ML
their experiment. Here the developed model is Multi-site algorithms chosen are K-NN, Naives Bayes classifier and
© 2020, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 1740
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 07 | July 2020 www.irjet.net p-ISSN: 2395-0072
Support vector machine (SVM). In order to analyze and interest in advance and then it can easy to provide
predict the future usage , the prediction of user behavior is personalized services. Suppose if any information is
done using Machine learning(ML) approach. In which a set missing , that can be easily retrieved. And future activities
of various features are extracted from datasets, then the and behaviors can be predicted
model for prediction is developed. The model is trained
based on 80% of training and 20% of testing 80% Behavioral Analysis
approach. These algorithms are most suitable and gave User Behavior Analysis(UBA) is the disciplinary way of
best result in the survey. The User Behavior Analysis analyzing behavior of that user. In an operational way it
(UBA) and prediction is estimated by invoking developed can be defined as, essentially collecting data, monitoring
model in python programming. And these results are the obtained data, processing that data for analyzing. The
compared with all the thee algorithms. required data sets for work is collected from the users
which is history of browsed data are stored in separate
files, databases, directories or data log files etc. The
purpose of this collection of datasets is a process to provide
desired parameters and from this data it is very easy to
build usable and reliable models the user. In other words,
it will precisely classify the user group and accurately
characterize the users. For example, nowadays the
Internet surfing has become most privileged space for this
type of application. Indeed, nowadays technologies are so
grown up in every aspects i.e, in order to collect data and
then exploit the present, past and future behavior of
Fig 1: Different social media network individual users. The three pillars of UBA are : Analysis of
data, integration of data and representation of data. The
2. USER BEHAVIOR ANALYSIS AND PREDICTION most difficult challenge faced is in analyzing and
processing the huge amount of data. The analysis of UBA in
From the past many years, analysis of user has been must be fast in preprocessing huge data of the users. And
focused on the intense efforts in marketing applications, selected developed ML algorithms should be appropriate
buying intention of some online buyers etc. Obviously, the to classify the users. Therefore, Machine Learning
objective of this work is to adopt efficient and some other algorithms must run in real time, so that it will be easily
new specific marketing strategies. And these strategies are accessing to complete data sets.
based on real time datasets. That is recorded dataset
information from the systems. Which includes the 3. DATASETS COLLECTION
past/previous activities of that users or clients. So this can
be defined as a data-based behavioral analysis , it because The dataset collection is done by an anonymized viewing of
analysis done on recorded data information. This analysis website by that particular users. From search engine we
has found its importance in detecting fraud information will know the what has been browsed previously. There
and fighting against fraud etc. So now, this is not that are many browsing history tools which collects browsed
surprise to see behavior analysis can enhance information data automatically as shown in table 1. Among those some
communication technology, detect internal threats like are open source and some are licensed version tools.
targeted attack, accelerate some repetitive tasks, adapt Different tools have different features to collect data. So we
software's to the users so organize more efficiently found tool named browsing history view tool which is
production tools etc. suitable for the work.
The user model is a representation of single user or it can Table -1: Various Browsing tools
be a group of multiple users in system. This developed
model includes a set of data/ parameters that are Sl Tool Availability Advantages Disadvantages
representative of the user’s previous behavior. The no.
development of user model starts with system designing 1 Time Free source -Best for Does not
which will be collecting all the data information needed your Google support other
web[10] chrome websites
for representing the users. The real time data obtained
from browsing tool can be used to deeply understand the -time bar
behavior of an user [4]. The model development is done graph
based on certain features and parameters which tells 2 Activity Free source -Available -Not good for
about user behavior, where the user is most interested in watch[10] for Mozilla Google chrome
extensions and other
surfing the data. This browsed information can be websites
obtained through various applications from the web. 3 Rescue Not free -Keeps -It is not free
From these developed model result we can know user Time[10] source complete source
© 2020, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 1741
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 07 | July 2020 www.irjet.net p-ISSN: 2395-0072
track of all
online
activities
4 Browsing Free source -Reads the -sometimes it
History history data bloat the 'Visit
View[9] from 4 Count' number
only for
different
internet
web explorer.
browsing
like
Internet
explorer,
Firefox,
Google
chrome,
Safar.
-Requires
Fig 2: Data Collection View
less
memory.
-Displays 4. METHODOLOGY
history Figure 3. shows complete process involved in classifying
and analysing the behaviour of the user.
Each log of the browsed data contains user information like
user identity (ID), Uniform Resource Locator (URL) , Title,
Visit Time, Visit Count, Web Browser, URL Length, Typed
Count, History File, Duration, Record Id. Now by crawling
and parsing the type of data has been browsed, we got
information regarding URL, title, typed count and viewed
website from the respective website browsing and content
providers. Specifically, we have classified browsing into 5
different types. Such as social media browsing, educational
purpose, shopping, news, entertainment and other
purpose. The different content providers in web has their
own way of naming conventions titles and other
parameters. For example, in some of the logs we observed
that at the beginning of data the content provider’s name is
embedded with the titles of the browsed websites. Then Fig 3. Process of classifying and analyzing UBA
from these naming conventions, the manual modification
is done to differentiate titles and website. By General steps for Predicting user behavior based on web
differentiating those parameter, mixing of parameters are browsing history are
avoided. Which later classify the data accurately and
effectively. Figure 2 is captured view of how data is Step1: Arrange and setup nearly 5-10 in number of
collected. The data is obtained is saved in excel or comma computers in the department. And these should be with the
separated values (csv) file. This is done because it is easy to same configurations like hard disk storage(RAM, ROM),
invoke these files later in programming. Power, Speed of the device etc.
The browsed data is collected at various time period i.e at Step 2: Then next step is to download and install web
morning, afternoon and night. This is done because browsing history applications in each of the computers i.e
browsing data varies from time to time. For example, some Browsing History View.
users are interested in news sites at the morning time,
Step 3: Now users are allowed freely for browsing data of
some user might browse study related sites and at the
their own interest, the browsing is done for certain time
night time user might browse social media networks. The
duration. Next step is to collect the browsed data from the
browsing data varies user from to user. So, for our work we
application installed in the system
have taken browsed data from morning to evening.
Step 4: The collected browsed data is given to machine
learning algorithms. The algorithms do the classification
based on similar data browsed. And those are grouped and
classified.
© 2020, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 1742
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 07 | July 2020 www.irjet.net p-ISSN: 2395-0072
Step 5: Thus, all features are classified and then we can class `crosses' is having the majority of neighbors. Figure 4
predict the user behavior analysis and compare with ML shows classification using k-NN approach.
algorithms.
The problem formulation in predicting behavior of user in (ii) Naives Bayesian theorem
ML terminology cab be naturally performed in a The NaivesBayes’ theorem briefly explains about the
straightforward way[6]. Here, the problem can be probability. That is, probability of happening an incident or
formulated as a classification task. The main goal is to event taking the of previous information which is related to
predict the outcome results of user. For new sets of `test' this event/ incident[7]. For instance, network traffic
samples one must build a predictive model M. From information related to the attack can be known with DoS
developed model we will get accuracy for that model. attack information. Therefore, comparing with the network
Some of classifier models are developed for our work traffic and assessing this without the knowing the past
which gave satisfactory results. network traffic information we can evaluate traffic of the
network probability using Bayes' theorem. A common and
(i) K nearest Neighbor (k-NN) efficient Machine Learning (ML) algorithm based on
The main idea of this K-NN techniques is to choose k probability calculation is a Naive Bayes (NB) classifier. This
neighboring vectors for all the input vectors. Let us NB classifier does the classification by estimating
consider x as an input vector. For selecting neighboring probability of the datasets. NB is a commonly known and
vectors of the input vectors is taken as distance metrics used as a supervised classifier. This NB classifiers
between the various data points of input vectors. Here for calculates posterior probability for the given data and then
finding minimum distance , Euclidean distance calculating uses the Bayes' theorem to forecast's that the probability.
method is applied. Now next step is to find the similarity Which does the feature sets of not labeled examples of NB
measure and then compare all the vectors. From the results classifier those examples fits a specific label of NB
of similarity measure we obtain the nearest K neighbors. classifiers. Now considering an intrusion detection as
In case of calculating the Euclidean space distance, the example, NB is used as a classifier to classify this traffic as
similarity measure i.e, S(p, q) between two vectors p and q abnormal or normal. The advantages of NB classifiers are
are to be considered. Now equation 1 shows Euclidean like ease of implementation, simplicity, applicability to
distance : binary and multi-class classification, robustness to
irrelevant features and requirement of low training.
……….(1)
Let us take an example, which will classify X as a test
instance, probabilistic model can be formulated with the
where p ,q be the two vectors, pi and qi are the ith entries
approach by considering and calculating the posterior
of the all the input vectors. And 'n' is the total number of
probability as for different ( and later the largest
entries of data points. = is weights of all the vector.
posterior probability for the given data is predicted. From
Which are correspondence to the importance in predicting
the rule of maximum a posteriori (MAP), we have Bayes
the all vectors. In n-dimensional Euclidean space vectors pi
Theorem equation 2 as
is a point in this vector space. Now vector Y can be
considered as predicting variable for prediction. Where Y is
examined by majority group of kNN to this n-dimensional = .......( 2 )
point. For example, if we got k = 5. And among five
clusters, three are the nearest neighbors are representing
where is calculations estimated by considering and
the symbols of a class. which are labeled as `crosses', or `1',
whether two others are of the class which are labeled as counting all the proportion of class data points in the
`0', or `circles', then variable Y will be equal to one. So, the
© 2020, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 1743
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 07 | July 2020 www.irjet.net p-ISSN: 2395-0072
© 2020, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 1744
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 07 | July 2020 www.irjet.net p-ISSN: 2395-0072
A. k-NN clustering training set approach. The classifier got good accuracy at
75-25% training algorithm i.e about 93.33%.
In this clustering method the data sample need to be
clustered in different k-cluster. The value of k can be C. KNN
estimated by WCSS method. This estimation calculation is Second algorithm chosen for classification is kNN
done using Euclidian distance this of sample. Figure 6 algorithm, for this also same method is applied for training
shows the graphical relationship between within cluster and testing. The accuracy result obtained are tabulated in
sum of squares (WCSS) and number of clusters. Then table 2.
select the number of clusters where the change in WCSS
begins to level off (elbow method).This is resultant graph In this algorithm got best result at 80-20% training
obtained for our datasets. approach about 58% of accuracy.
D. SVM
The last algorithm chosen from the survey is SVM. So for
this algorithm same kind of methodology is followed. The
result got from this algorithm is tabulated in table 2.
From the graph and result we can say that that user most
interested in browsing data which is related to entertaining
sites. This is because number of time visit and duration of
time he spent is more compared to other clusters. Fig 8 : Accuracy calculations of various algorithms
B. Naives Classifier From the figure 8, it is observed that for all the percentage
In this NB classifier algorithm datasets are trained and of data sets, NB is giving the best performance when
tested at different at different frequencies. The developed compared with the other algorithms.
NB classifier algorithms results are tabulated as table
2.These are the accuracy performance results at various
© 2020, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 1745
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 07 | July 2020 www.irjet.net p-ISSN: 2395-0072
6. CONCLUSIONS [7] Liu, Qingchao & Lu, Jian & Chen, Shuyan & Zhao,
Kangjia. (2014). Multiple Naïve Bayes Classifiers Ensemble
In this work, with user browsed data are obtained from for Traffic Incident Detection. Mathematical Problems in
browsing history viewing tool. Here we model browsed Engineering. 2014. 1-16. 10.1155/2014/383671.
data in reference to social network, entertainment
platform, others etc. Through real data analysis, we [8] N. K. Gyamfi and J. Abdulai, "Bank Fraud Detection
observe user where that person is most interested. So we Using Support Vector Machine," 2018 IEEE 9th Annual
developed some of ML algorithm to classify data. The Information Technology, Electronics and Mobile
Proposed models are used analyze and predict the UB from Communication Conference (IEMCON), Vancouver, BC,
the dataset and then calculate the accuracy of each 2018, pp. 37-41, doi: 10.1109/IEMCON.2018.8614994.
algorithm developed. The algorithm chosen here are KNN,
[9]. https://en.wikipedia.org/wiki/Web_browsing_history.
NB classifier, SVM and for clustering the data k means is
used. [10].https://hetmanrecovery.com/web-browser-history-
viewer-software.htm.
Among these developed algorithms, we got good result for
NB classifier. It gave an accuracy about 93%. From this we
can conclude that Naives Bayes Classifier is the best among
all other algorithms developed and we got satisfactory
results for this work carried out.
© 2020, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 1746