[go: up one dir, main page]

0% found this document useful (0 votes)
1 views7 pages

Irjet V7i7318

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 7

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 07 Issue: 07 | July 2020 www.irjet.net p-ISSN: 2395-0072

Predicting the User Behavior Analysis using Machine Learning


Algorithms
Ashwini1, K Viswavardhan Reddy2
1Digital Communication Engineering Dept. Telecommunication, RV College of Engineering, Bengaluru, India
2Associate Professor, Dept. Telecommunication, RV College of Engineering, Bengaluru, India

---------------------------------------------------------------------***----------------------------------------------------------------------
Abstract - Due to COVID-19, there is a rapid growth in Probabilistic Factorization (MPF) in [2] . The data sets
the usage of internet in regards of reading news, were collected from six different popular video websites.
communicating with friends, playing games, working from This model has captured two different such as cross-site as
home and surfing etc. Hence, there is a need for and site-specific preferences. From the work MPF model
understanding user's web browsing behavior by the above achieved accuracy of 97± 2%. Understanding, the user
activities carried via web browsing. With this we can behaviors and access pattern has got a great importance to
improve their browsing experience. So in this paper, we different content provider. Researchers in [3] has carried
present various machine learning(ML) algorithms to predict out work by employing the unique data collected from an
and analyze the current user behaviors. The main objective Internet service provider (ISP) service, and they
of this work is discriminating and classifying the close systematically analyzed the user behaviors and viewing
group to which user is most interested. The event related to patterns across the six most popular content providers.
user's surfing that data is collected using browsing history The classification of data in [4] is done with the help of ML
application. Classification, helps us to identify similar algorithms such as Logistic regression and decision tree
groups of data that has been browsed based on parameters classifier. The decision tree has produced an accuracy
such as most visited, time duration, type of website that about 86%. So, from the results they have concluded that
person has browsed. Algorithms such as k-Nearest neighbor Decision tree is the best compared to Logistic regression.
(KNN), Naives Bayes (NB), Support Vector Machine (SVM), Authors in [5] has experimented on Web log along with
and K means clustering has been compared. From the individual users collecting data from website and
results it is observed that naive Bayesian has predicted with discussed behavior of user using Long short term model
good accuracy of about 90±4. (LSTM). Parallel FP-Growth (PFP), Large page sets based
parallel FP-Growth (LPS-PFP) and most interesting
Key Words: Machine learning; K-Nearest neighbor; pattern-based parallel FP-growth (MIP-PFP) were
Naives Bayes; Support vector machine; User Behavior compared, and MIP-PFP outperformed well compared to
analysis. other algorithms in [7]. Authors in [8], carried out an
experiment using Support vector machines (SVM), and the
1. INTRODUCTION results obtained as prediction performance is better than
Back Propagation Networks (BPN) and got average
Web browsing is defined as searching for useful prediction accuracy about 80%.
information from the web and displaying it to users. There
There are several issues being faced by the researchers
are different ways to obtain desired information from the
from their work such as the amount of data required for
server logs. Web browsing is the process of extracting
the work must very large for ML algorithms so that result
useful information or particular data from server. In this
will be more accurate in preprocessing and prediction.
browsing process, browser find outs the what users are
Here, Preprocessing included challenges such as handling
looking for and most interested in searching on the
of large data. Sometimes that cannot fit in to the memory.
Internet. From the survey we found , most of the authors
Several potential research challenges has been faced in
have carried out their work using Machine learning
working with ML/DL[1].
techniques. This is because, these algorithms gave a good
accuracy result when compared to other techniques like In this paper, we address the above issues discussed by
Artificial neural networks, Deep learning algorithms. collecting real time datasets. For this, we made an
In [1] authors have implemented Machine Learning (ML) experimental set up with 20 computers in a lab and
algorithms, which have demonstrated their effectiveness in requested all the students to browse for an hour without
clustering. The work has been carried out by collecting a any monitoring. Later these real time datasets were
real time datasets from the hospital. The result analysis collected from browsed history of various websites. Figure
shown that the ML algorithms are most effective and 1 represents some of social networks which users
suitable in clustering the data as well as in predicting user interested in browsing. These datasets are given to
behaviors. Authors have proposed a generative model for developed machine learning algorithms. The ML
their experiment. Here the developed model is Multi-site algorithms chosen are K-NN, Naives Bayes classifier and

© 2020, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 1740
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 07 | July 2020 www.irjet.net p-ISSN: 2395-0072

Support vector machine (SVM). In order to analyze and interest in advance and then it can easy to provide
predict the future usage , the prediction of user behavior is personalized services. Suppose if any information is
done using Machine learning(ML) approach. In which a set missing , that can be easily retrieved. And future activities
of various features are extracted from datasets, then the and behaviors can be predicted
model for prediction is developed. The model is trained
based on 80% of training and 20% of testing 80% Behavioral Analysis
approach. These algorithms are most suitable and gave User Behavior Analysis(UBA) is the disciplinary way of
best result in the survey. The User Behavior Analysis analyzing behavior of that user. In an operational way it
(UBA) and prediction is estimated by invoking developed can be defined as, essentially collecting data, monitoring
model in python programming. And these results are the obtained data, processing that data for analyzing. The
compared with all the thee algorithms. required data sets for work is collected from the users
which is history of browsed data are stored in separate
files, databases, directories or data log files etc. The
purpose of this collection of datasets is a process to provide
desired parameters and from this data it is very easy to
build usable and reliable models the user. In other words,
it will precisely classify the user group and accurately
characterize the users. For example, nowadays the
Internet surfing has become most privileged space for this
type of application. Indeed, nowadays technologies are so
grown up in every aspects i.e, in order to collect data and
then exploit the present, past and future behavior of
Fig 1: Different social media network individual users. The three pillars of UBA are : Analysis of
data, integration of data and representation of data. The
2. USER BEHAVIOR ANALYSIS AND PREDICTION most difficult challenge faced is in analyzing and
processing the huge amount of data. The analysis of UBA in
From the past many years, analysis of user has been must be fast in preprocessing huge data of the users. And
focused on the intense efforts in marketing applications, selected developed ML algorithms should be appropriate
buying intention of some online buyers etc. Obviously, the to classify the users. Therefore, Machine Learning
objective of this work is to adopt efficient and some other algorithms must run in real time, so that it will be easily
new specific marketing strategies. And these strategies are accessing to complete data sets.
based on real time datasets. That is recorded dataset
information from the systems. Which includes the 3. DATASETS COLLECTION
past/previous activities of that users or clients. So this can
be defined as a data-based behavioral analysis , it because The dataset collection is done by an anonymized viewing of
analysis done on recorded data information. This analysis website by that particular users. From search engine we
has found its importance in detecting fraud information will know the what has been browsed previously. There
and fighting against fraud etc. So now, this is not that are many browsing history tools which collects browsed
surprise to see behavior analysis can enhance information data automatically as shown in table 1. Among those some
communication technology, detect internal threats like are open source and some are licensed version tools.
targeted attack, accelerate some repetitive tasks, adapt Different tools have different features to collect data. So we
software's to the users so organize more efficiently found tool named browsing history view tool which is
production tools etc. suitable for the work.

The user model is a representation of single user or it can Table -1: Various Browsing tools
be a group of multiple users in system. This developed
model includes a set of data/ parameters that are Sl Tool Availability Advantages Disadvantages
representative of the user’s previous behavior. The no.
development of user model starts with system designing 1 Time Free source -Best for Does not
which will be collecting all the data information needed your Google support other
web[10] chrome websites
for representing the users. The real time data obtained
from browsing tool can be used to deeply understand the -time bar
behavior of an user [4]. The model development is done graph
based on certain features and parameters which tells 2 Activity Free source -Available -Not good for
about user behavior, where the user is most interested in watch[10] for Mozilla Google chrome
extensions and other
surfing the data. This browsed information can be websites
obtained through various applications from the web. 3 Rescue Not free -Keeps -It is not free
From these developed model result we can know user Time[10] source complete source

© 2020, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 1741
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 07 | July 2020 www.irjet.net p-ISSN: 2395-0072

track of all
online
activities
4 Browsing Free source -Reads the -sometimes it
History history data bloat the 'Visit
View[9] from 4 Count' number
only for
different
internet
web explorer.
browsing
like
Internet
explorer,
Firefox,
Google
chrome,
Safar.
-Requires
Fig 2: Data Collection View
less
memory.
-Displays 4. METHODOLOGY
history Figure 3. shows complete process involved in classifying
and analysing the behaviour of the user.
Each log of the browsed data contains user information like
user identity (ID), Uniform Resource Locator (URL) , Title,
Visit Time, Visit Count, Web Browser, URL Length, Typed
Count, History File, Duration, Record Id. Now by crawling
and parsing the type of data has been browsed, we got
information regarding URL, title, typed count and viewed
website from the respective website browsing and content
providers. Specifically, we have classified browsing into 5
different types. Such as social media browsing, educational
purpose, shopping, news, entertainment and other
purpose. The different content providers in web has their
own way of naming conventions titles and other
parameters. For example, in some of the logs we observed
that at the beginning of data the content provider’s name is
embedded with the titles of the browsed websites. Then Fig 3. Process of classifying and analyzing UBA
from these naming conventions, the manual modification
is done to differentiate titles and website. By General steps for Predicting user behavior based on web
differentiating those parameter, mixing of parameters are browsing history are
avoided. Which later classify the data accurately and
effectively. Figure 2 is captured view of how data is Step1: Arrange and setup nearly 5-10 in number of
collected. The data is obtained is saved in excel or comma computers in the department. And these should be with the
separated values (csv) file. This is done because it is easy to same configurations like hard disk storage(RAM, ROM),
invoke these files later in programming. Power, Speed of the device etc.

The browsed data is collected at various time period i.e at Step 2: Then next step is to download and install web
morning, afternoon and night. This is done because browsing history applications in each of the computers i.e
browsing data varies from time to time. For example, some Browsing History View.
users are interested in news sites at the morning time,
Step 3: Now users are allowed freely for browsing data of
some user might browse study related sites and at the
their own interest, the browsing is done for certain time
night time user might browse social media networks. The
duration. Next step is to collect the browsed data from the
browsing data varies user from to user. So, for our work we
application installed in the system
have taken browsed data from morning to evening.
Step 4: The collected browsed data is given to machine
learning algorithms. The algorithms do the classification
based on similar data browsed. And those are grouped and
classified.

© 2020, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 1742
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 07 | July 2020 www.irjet.net p-ISSN: 2395-0072

Step 5: Thus, all features are classified and then we can class `crosses' is having the majority of neighbors. Figure 4
predict the user behavior analysis and compare with ML shows classification using k-NN approach.
algorithms.

A. Machine Learning Algorithms For Uba


Many of the ML approaches has shown the promising
results required for the predicting user behavior on
website. From the high-level model of information one can
create direction for modeling for predicting user behavior.
A high-level statistics cab be built on how exactly data
propagates over time. In this visit time i.e duration spent
and number of time visits made to that browser became
useful method in understanding the UBA in Web. And also
by taking various other features from obtained data we
will get more detailed view of user behavior[6]. The feature
parametric based approach for developing a model will
provide a more precise model with good accuracy. Fig 4: Classification using a k-NN approach

The problem formulation in predicting behavior of user in (ii) Naives Bayesian theorem
ML terminology cab be naturally performed in a The NaivesBayes’ theorem briefly explains about the
straightforward way[6]. Here, the problem can be probability. That is, probability of happening an incident or
formulated as a classification task. The main goal is to event taking the of previous information which is related to
predict the outcome results of user. For new sets of `test' this event/ incident[7]. For instance, network traffic
samples one must build a predictive model M. From information related to the attack can be known with DoS
developed model we will get accuracy for that model. attack information. Therefore, comparing with the network
Some of classifier models are developed for our work traffic and assessing this without the knowing the past
which gave satisfactory results. network traffic information we can evaluate traffic of the
network probability using Bayes' theorem. A common and
(i) K nearest Neighbor (k-NN) efficient Machine Learning (ML) algorithm based on
The main idea of this K-NN techniques is to choose k probability calculation is a Naive Bayes (NB) classifier. This
neighboring vectors for all the input vectors. Let us NB classifier does the classification by estimating
consider x as an input vector. For selecting neighboring probability of the datasets. NB is a commonly known and
vectors of the input vectors is taken as distance metrics used as a supervised classifier. This NB classifiers
between the various data points of input vectors. Here for calculates posterior probability for the given data and then
finding minimum distance , Euclidean distance calculating uses the Bayes' theorem to forecast's that the probability.
method is applied. Now next step is to find the similarity Which does the feature sets of not labeled examples of NB
measure and then compare all the vectors. From the results classifier those examples fits a specific label of NB
of similarity measure we obtain the nearest K neighbors. classifiers. Now considering an intrusion detection as
In case of calculating the Euclidean space distance, the example, NB is used as a classifier to classify this traffic as
similarity measure i.e, S(p, q) between two vectors p and q abnormal or normal. The advantages of NB classifiers are
are to be considered. Now equation 1 shows Euclidean like ease of implementation, simplicity, applicability to
distance : binary and multi-class classification, robustness to
irrelevant features and requirement of low training.
……….(1)
Let us take an example, which will classify X as a test
instance, probabilistic model can be formulated with the
where p ,q be the two vectors, pi and qi are the ith entries
approach by considering and calculating the posterior
of the all the input vectors. And 'n' is the total number of
probability as for different ( and later the largest
entries of data points. = is weights of all the vector.
posterior probability for the given data is predicted. From
Which are correspondence to the importance in predicting
the rule of maximum a posteriori (MAP), we have Bayes
the all vectors. In n-dimensional Euclidean space vectors pi
Theorem equation 2 as
is a point in this vector space. Now vector Y can be
considered as predicting variable for prediction. Where Y is
examined by majority group of kNN to this n-dimensional = .......( 2 )
point. For example, if we got k = 5. And among five
clusters, three are the nearest neighbors are representing
where is calculations estimated by considering and
the symbols of a class. which are labeled as `crosses', or `1',
whether two others are of the class which are labeled as counting all the proportion of class data points in the
`0', or `circles', then variable Y will be equal to one. So, the

© 2020, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 1743
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 07 | July 2020 www.irjet.net p-ISSN: 2395-0072

training set. Then can be ignored. Now we are


comparing different 'w' on the same data points. So we
need to consider . From this, suppose if we get an
accurate estimate of from the given training data
it will be the best classifier in theory. Therefore, we get the
resulting Bayes as a optimal classifier with the smallest
error rate in calculation. However estimation is
not a straightforward thing. Therefore it is involved with
the estimation of exponential numbers of joint-
probabilities of that features. To make the good estimation
some of assumptions are made. So here NB classifier Fig 5: Classification using an SVM approach.
assumes that given the class label will have all the 'n'
features which are independent with one another within Analogously to other ML approaches, an implementation of
same class. Thus, we have equation 3 as, SVM exists as most common ML technique in achieving the
better accuracy of prediction.
= ........( 3 )
(iv) k-clustering
Now we need to calculate probability for estimating every k-clustering is an unsupervised method of ML approach in
feature value in each of the class so that the conditional clustering the data. This clustering aims to discover k
probability can be estimated. And after estimating each number of clusters from the input data. Where k refers to
feature so that the calculation of joint-probabilities cab be total number of clusters that need to be generated by
avoided. Now in the training set, the naive Bayes classifier algorithm for that input datasets. This method of clustering
calculates estimates the probabilities of for all is basically calculated and implemented by iteratively
allocating each data point to the various clusters. These
classes. For which w € w and for all features of
data points are allocated to one of k clusters of total
the class. That is i can be i = 1,2,3,......,n from the training set clusters according to the and minimum distance between
xi. In the test set, test instance which are labeled with w the points and also based on various features. To get the
will be predicted and the condition it lays is, it is predicted ultimate result of clustering repeatedly trained and tested.
only if w leads to largest value of all labels of class. The inputs of the algorithm are only the datasets and the k
clusters. Firstly, the k centroids are estimated by Within
ά cluster sum of squares (wcss) or other method. And then
each of the sample in data is assigned to its one of its
As from ML, the NB techniques provides a very simple closest cluster. This assigning is done estimating centroids.
approach, with clear semantics, also for using, for Which is calculated using squared Euclidean distance
representing, and for knowledge of learning probability. between the all the points. Secondly, once all the data
This NB classifier goal is to accurately predict the class of points of the samples are assigned to a specified cluster,
test instances. and then again centroids are recalculated by taking mean
of all sample values of cluster. The algorithm iteration
(iii) Support vector machine (SVM) continued to iterate until no sample of the data is left. The
Support vector machine is an non-probabilistic classifier. performance and accuracy of clustering is less precise than
The assigning of labels for prediction is done with SVM those of the other supervised leaning methods. In
model. Where it predicts into one or other category. The generating a labeled data it is difficult to generate. So, in
SVM builds a boundary between many of data points in n- this case unsupervised algorithms are good. The
dimensional space. Here n represents the total number of Unsupervised ML methods have many applications in
features. Considering a SVM classifier, the boundary security domain.
between any of the two classes can be considered only
after training this SVM classifier. Here two different classes 5. EXPERIMENTAL RESULTS
are labeled as crosses and circles. Suppose for the new 'n'
dimensional point if points lies above boundary line then it The ML algorithms have been implemented considering
is classified and labeled as 'cross' and a 'circles' if not[8]. some of the features. Firstly, the real time datasets which
From the figure 5 we see that points A, B and C are is needed to our work is collected from user personal
classified as `crosses'. This is because they are above the computers are installed with browser history tool. Now
boundary level. datasets are rearranged and modified. These datasets are
divided as testing and training sets (20-80%) approach.
The 80% training is train the developed ML algorithm and
20% testing is to test output results once classification is
done

© 2020, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 1744
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 07 | July 2020 www.irjet.net p-ISSN: 2395-0072

A. k-NN clustering training set approach. The classifier got good accuracy at
75-25% training algorithm i.e about 93.33%.
In this clustering method the data sample need to be
clustered in different k-cluster. The value of k can be C. KNN
estimated by WCSS method. This estimation calculation is Second algorithm chosen for classification is kNN
done using Euclidian distance this of sample. Figure 6 algorithm, for this also same method is applied for training
shows the graphical relationship between within cluster and testing. The accuracy result obtained are tabulated in
sum of squares (WCSS) and number of clusters. Then table 2.
select the number of clusters where the change in WCSS
begins to level off (elbow method).This is resultant graph In this algorithm got best result at 80-20% training
obtained for our datasets. approach about 58% of accuracy.

D. SVM
The last algorithm chosen from the survey is SVM. So for
this algorithm same kind of methodology is followed. The
result got from this algorithm is tabulated in table 2.

In this algorithm we got best result at 70-30% training


approach about 55% of accuracy.

Table 2: Comparison table

Algorithm 80- 75- 70- 60-


20% 25% 30% 40%

Fig 6: WCSS graph for finding K


NB 93.33 91.66% 75.11% 72.22%
After estimating value of k, from graph, k =5. So, number Classifier %
of clusters will be 5. Now clustered result is shown in figure
7 KNN 58% 53% 50% 31%

SVM 50% 53% 55% 33.3%

From comparison table 2 we can say best ML for our


dataset is Naives Bayes classifier. This is because it is giving
accuracy about 93.33%

Fig 7: k-Clustered results

From the graph and result we can say that that user most
interested in browsing data which is related to entertaining
sites. This is because number of time visit and duration of
time he spent is more compared to other clusters. Fig 8 : Accuracy calculations of various algorithms
B. Naives Classifier From the figure 8, it is observed that for all the percentage
In this NB classifier algorithm datasets are trained and of data sets, NB is giving the best performance when
tested at different at different frequencies. The developed compared with the other algorithms.
NB classifier algorithms results are tabulated as table
2.These are the accuracy performance results at various

© 2020, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 1745
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 07 | July 2020 www.irjet.net p-ISSN: 2395-0072

6. CONCLUSIONS [7] Liu, Qingchao & Lu, Jian & Chen, Shuyan & Zhao,
Kangjia. (2014). Multiple Naïve Bayes Classifiers Ensemble
In this work, with user browsed data are obtained from for Traffic Incident Detection. Mathematical Problems in
browsing history viewing tool. Here we model browsed Engineering. 2014. 1-16. 10.1155/2014/383671.
data in reference to social network, entertainment
platform, others etc. Through real data analysis, we [8] N. K. Gyamfi and J. Abdulai, "Bank Fraud Detection
observe user where that person is most interested. So we Using Support Vector Machine," 2018 IEEE 9th Annual
developed some of ML algorithm to classify data. The Information Technology, Electronics and Mobile
Proposed models are used analyze and predict the UB from Communication Conference (IEMCON), Vancouver, BC,
the dataset and then calculate the accuracy of each 2018, pp. 37-41, doi: 10.1109/IEMCON.2018.8614994.
algorithm developed. The algorithm chosen here are KNN,
[9]. https://en.wikipedia.org/wiki/Web_browsing_history.
NB classifier, SVM and for clustering the data k means is
used. [10].https://hetmanrecovery.com/web-browser-history-
viewer-software.htm.
Among these developed algorithms, we got good result for
NB classifier. It gave an accuracy about 93%. From this we
can conclude that Naives Bayes Classifier is the best among
all other algorithms developed and we got satisfactory
results for this work carried out.

In future we will be trying to predict and analyze by


considering other parameter for the work. And we will try
to compare with other algorithms too. In future other
algorithms may give better accuracy performance than NB
Classifier.
REFERENCES

[1]. M. Callara and P. Wira, "User Behavior Analysis with


Machine Learning Techniques in Cloud Computing
Architectures," in the preceeding of International
Conference on Applied Smart Systems (ICASS), Medea,
Algeria, 2018, pp. 1-6.

[2] H. Yan, C. Yang, D. Yu, Y. Li, D. Jin and D. Chiu, "Multi-


site User Behavior Modeling and Its Application in Video
Recommendation," in IEEE Transactions on Knowledge
and Data Engineering.

[3].Ladekar, Ashwini, Pooja Pawar, Dhanashree Raikar and


Jayashree Chaudhari. “Web Log based Analysis of User's
Browsing Behavior.” (2017).

[4] R., Virendra & V., Govind, "Prediction of User Behavior


using Web log in Web Usage Mining" in the proceedings of
International Journal of Computer Applications. 139. 4-7.
10.5120/ijca2016909228.

[5] V. Anitha and P. Isakki, "A survey on predicting user


behavior based on web server log files in a web usage
mining," in the proceeding International Conference on
Computing Technologies and Intelligent Data Engineering
(ICCTIDE'16), Kovilpatti, 2016, pp. 1-4.

[6] Sisodia, D. S., Khandal, V., & Singhal, R. (2018). Fast


prediction of web user browsing behaviours using most
interesting patterns. Journal of Information Science, 44(1),
74– 90.

© 2020, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 1746

You might also like