Dissertation on
“Impact of User’s Relationship on User
Interest”
Submitted in partial fulfilment of the requirements for the award of degree
of
Bachelor of Technology in
Computer Science & Engineering
Submitted by:
Varnika Srivastava Vedanti Under the guidanc
Vinod Ladda Vibush Karthee
Shanmugam Internal Guide Pro
01FB15ECS336 Bhaskarjyoti Das
01FB15ECS345 Visiting Professor P
01FB15ECS348 University
January – May 2019
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
FACULTY OF ENGINEERING PES UNIVERSITY (Established under
Karnataka Act No. 16 of 2013) 100ft Ring Road, Bengaluru – 560 085, Karnataka, India
PES UNIVERSITY
(Established under Karnataka Act No. 16 of 2013) 100ft
Ring Road, Bengaluru – 560 085, Karnataka, India
FACULTY OF ENGINEERING
CERTIFICATE
This is to certify that the dissertation
entitled
‘Impact of User’s Relationship on User
Interest’
is a bonafide work carried out by
Varnika Srivastava Vedanti 01FB15ECS336
Vinod Ladda Vibush Karthee 01FB15ECS345
Shanmugam 01FB15ECS348
In partial fulfilment for the completion of eighth semester project work in the Program of Study
Bachelor of Technology in Computer Science and Engineering under rules and regulations of PES
University, Bengaluru during the period Jan. 2019 – May. 2019. It is certified that all corrections /
suggestions indicated for internal assessment have been incorporated in the report. The
dissertation
th
has been approved as it satisfies the 8 semester academic requirements in respect of project
work.
Signature Prof. Faculty
Bhaskarjyoti Das
Associate Professor
Signature Dr.
External Viva
Shylaja S S
Chairperson
Signature Dr. B K Name of the Examiners
Keshavan Dean of
Faculty
Signature Dr. B K 1. __________________________
Keshavan Dean of
2. __________________________ __________________________
Signature with Date
__________________________
DECLARATION
We hereby declare that the project entitled “Impact of User’s
Relationship on
User Interest” has been carried out by us under the guidance of
Professor
Bhaskarjyoti Das, Adjunct Faculty and submitted in partial fulfillment of
the
course requirements for the award of degree of Bachelor of Technology
in
Computer Science and Engineering of PES University, Bengaluru
during the
academic semester January – May 2019. The matter embodied in this
report has
not been submitted to any other university or institution for the award of any
degree.
01FB15ECS336 Varnika Srivastava <Signature>
01FB15ECS345 Vedanti Ladda <Signature>
01FB15ECS348 Vibush Shanmugam <Signature>
ACKNOWLEDGEMENT
We would like to thank our guide Prof. Bhaskarjyoti Das for his guidance, support
and
patience throughout our final year project, without which our project would be
incomplete.
We appreciate the efforts made by the project co-ordinators at PES University –
Prof.
Preet Kanwal and Prof. Sangeetha V, for efficiently managing the project logistics
for all
students of Computer Science
Department
We are extremely grateful to our Chairperson, Computer Science and
Engineering,
Dr.Shylaja S Sharath, for her valuable support throughout her time as the
Chairperson.
We would like to thank Dr. B K Keshavan, for managing his responsibilities as a
Dean of
Faculty extremely well.
Our Vice Chancellor, Dr. K N Balasubramanya Murthy, must be credited for not
only
managing the day-to-day operations of the institution, but also constantly
encouraging
students to do better.
We are grateful to our CEO and pro chancellor, Dr. Jawahar Doreswamy, for
envisioning
the idea of giving students an opportunity to gain practical exposure through
internships
in addition to traditional learning, and giving them an all-round experience while
studying at this esteemed
institution
Our acknowledgement would be incomplete without thanking our Chairman and
Chancellor, Dr. M R Doreswamy, for founding this prestigious
institution.
Last, but not the least, we thank our parents for their gratuitous
support.
ABSTRACT
A user’s direct or hidden relationship to another user can have impact on his interests. A
direct
relationship would mean a network of friends where users exercise their own choice,
and
hidden relationship would mean a network of similar users based on certain user
attributes.
In this light, we study various social networks that can impact user interest. In an explicit
network of friends, the links are likely to get stale with time as users may not interact
frequently
with all their friends. Hence, an implicit network based on user similarity, should provide
more
relevant information for predicting user
interest.
In case of the Yelp dataset, the interest is the business. The task of finding which
business
interests a user can be modelled as a link prediction problem between user and
business.
User relationship from social networks, can be captured using node embedding and this
information can be used to perform link prediction. Explicit networks and implicit
networks
were compared for their accuracy at this link prediction
task.
It was observed that implicit networks outperformed the explicit friends network, for
similar
settings, for predicting links between user and
business.
TABLE OF CONTENTS
Introduction ...............................................................................................................1
1.1 Overview .......................................................................................................................................1
1.2 Scope ............................................................................................................................................2
1.3 Objective .......................................................................................................................................4
1.4 Outcomes .....................................................................................................................................5
Research Background ..............................................................................................6
2.1 Literature Survey ........................................................................................................................6 2.1.1
Analyzing Implicit Social Networks in Multiplayer Online Games [7] ..........................................6
2.1.2 Suggesting (More) Friends Using the Implicit Social Graph [2] ....................................................7
2.1.3 Identifying Implicit Relationships between social media users to support social commerce [8] ........8
Implementation ...................................................................................................................................9
Results ............................................................................................................................................ 10
2.1.4 Measuring Tie Strength in Implicit Social Networks [9] ............................................................ 10
2.1.5 Learning to Recommend with Explicit and Implicit Social Relations [3] .................................... 12
2.1.6 Search Lens [10] .................................................................................................................... 13
2.1.7 Graph Embeddings for Social Network Analysis: State of the Art [11] ....................................... 14
2.1.8 Representation Learning and Graph Classification [12] ............................................................. 14
2.1.9 DeepInf: Modeling Influence Locality in Large Social Networks [13] ........................................ 15
2.1.10 Attributed Social Network Embedding [14] ........................................................................... 15
2.1.11 Network Representation Learning with Rich Text Information[15] .......................................... 15
2.1.12 Heterogeneous Network Embedding via Deep Architectures [16] ........................................... 16
2.1.13 Node2Vec: Scalable Feature Learning for Networks [4].......................................................... 16
2.1.14 Link Prediction in Heterogeneous Social Networks [19] ......................................................... 18
2.1.15 A Link Prediction Approach to Recommendations in Large-Scale User-Generated Content Systems
[20] ................................................................................................................................................. 18
Methodology............................................................................................................ 20
3.1 Proposed Approach ................................................................................................................. 20
3.2 Improvement over Alternative Approaches .......................................................................... 20
3.3 High Level System Architecture ............................................................................................. 22
Environment Requirements ................................................................................... 25
4.1 Hardware Requirements .......................................................................................................... 25
4.2 Software Requirements ........................................................................................................... 25
4.3 Data Requirements .................................................................................................................. 27
Demonstration of Outcome .................................................................................... 31
Proposed Approach ................................................................................................ 33
6.1 Creating User Explicit network of friends ............................................................................. 33
6.2 Creating User Business bipartite network ............................................................................ 33
6.3 Creating Implicit Network Type 1 ........................................................................................... 33
6.4 Creating Implicit Network Type 2 ........................................................................................... 34
6.5 Node Embedding ..................................................................................................................... 34
6.6 The Link Prediction Problem ................................................................................................... 35
6.7 Demo Approach ........................................................................................................................ 36
Detailed Design ....................................................................................................... 39
7.1 Creating the networks ............................................................................................................. 39
7.2 Performing Node2Vec ............................................................................................................. 42
7.3 Link prediction ......................................................................................................................... 44
Results ..................................................................................................................... 47
8.1 Visualizations of networks...................................................................................................... 47 8.1.1
User explicit network of friends .............................................................................................. 47
8.1.2 User implicit network type 1 ................................................................................................... 49
8.1.3 User implicit network type 2 ................................................................................................... 50 8.1.4
Comparison of the three networks ........................................................................................... 51
8.2 Link prediction results ............................................................................................................ 52
Conclusions ............................................................................................................ 55
Future Work ............................................................................................................. 56
References .............................................................................................................. 57
Appendices ............................................................................................................. 59
12.1 Implicit Network Type 2 after randomly sampling the edges to reduce the size ............. 59
12.2 Visualization code ................................................................................................................. 60
12.3 Network creation code .......................................................................................................... 62
12.4 Node embedding code .......................................................................................................... 66
12.5 Link prediction code ............................................................................................................. 66
List of Tables
Table No. Title Page
Number
2.1.4.1 Axioms satisfied by various similarity measures 11
2.1.13.1 Area Under Curve Scores for Link Prediction using the
Hadamard operator
18
6.5.1 Node2Vec Parameters for the four networks 34
8.1.4.1 Comparison Table for the user explicit and user implicit
parameters
52
networks in terms of network
8.2.1 Model Accuracy and Recall, Precision F1-Score for class 1 53
8.2.2 Model Accuracy and Recall, Precision F1-Score for class 0 53
List of Figures No.
Title Page
Number
Figure
2.1.4.1 Measuring Kendall-Tau Rank correlations on rankings produced
by different pairs of tie-strength
measures
2.1.13.1 Node2Vec Algorithm
3.2.1 Module Dependencies
3.2.2 High level architecture diagram
3.4.1 A sample user JSON
3.4.2 A sample business JSON
3.4.3 A sample review JSON
5.1 Use case diagram of the user interface
6.7.1 A sample view of the UI for recommendations page
7.1.1 A snapshot of the user feature vectors file
7.1.2 High Level Flowchart for network creation
7.2.1 High Level Flowchart for Network Embedding using Node2Vec
7.3.1 High Level Flowchart for Link Prediction
8.1.1.1 Degree histogram of the user explicit network
8.1.1.2 Logarithmic plot for degree and count for explicit network
8.1.1.3 Visual for the entire user explicit network
8.1.2.1 Degree histogram of the user implicit network type 1
8.1.2.2 Logarithmic plot for degree and count for implicit network type 1
8.1.2.3 Visual for the entire user implicit network type 1
8.1.3.1 Degree histogram of the user implicit network type 2
8.1.3.2 Logarithmic plot for degree and count for implicit network type 2
8.1.3.3 Visual for the entire user implicit network type 2
12.1.1 Degree histogram of the subgraph of user implicit network type 2
after random sampling of edges from
original
12.1.2 Logarithmic plot for degree and count for subgraph of implicit
network type 2
12.1.3 Visual for the subgraph of user implicit network type 2
12.2.1 Reading graph from an edgelist file
12.2.2 Computing number of nodes and edges
12.2.3 Computing network density
12.2.4 Degree distribution histogram plot
12.2.5 Logarithmic plot of degree and count
12.2.6 Computing Global Clustering Co-efficient
12.3.1 Creating the user explicit network of friends
12.3.2 Creating the user business review network
12.3.3 Creating the implicit network type 1
12.3.4 Creating the implicit network type 2
12.4.1 Performing Node Embedding using Node2vec
12.5.1 Creating the dataset by adding embedding vector
12.5.2 Training SVM Model
12.5.3 Testing the SVM model and computing accuracy
12.5.4 Computing Recall, Precision, F1-score for each class k
12.6.1 UI Snapshot 1
12.6.2 UI Snapshot 2
Definitions, Acronyms and
Abbreviations
1. Implicit networks – networks that are created by defining relationships based on
some
inherent property of the system
2. Explicit networks – networks that are created by defining relationships that are
explicitly created in the system.
3. Bipartite networks – the network can be divided into two sets of nodes where an
edge
exists only between nodes that belong to different
sets
4. Edges/links – connection between two nodes indicating some relationship
between the
two nodes
5. Graph/Network embedding - Network embedding is an important method to learn
low-
dimensional representations of vertices in networks, aiming to capture and
preserve the
network structure. Note that a network is represented as a graph (hence graph
embedding) and hence the terms network embedding and graph embedding are
used
interchangeably
6. Node2Vec - an algorithmic framework for learning continuous feature
representations
for nodes in networks. In node2vec, nodes are mapped to a low-dimensional
space of
features that maximizes the likelihood of preserving network neighborhoods of
nodes.
7. LINE - Large-scale Information Network Embedding A network embedding
algorithm
suitable for arbitrary types of information networks: undirected, directed, and/or
weighted and preserves both the local and global network
structures.
8. DeepWalk – A network embedding algorithm that uses deep learning to learn
node
representations. And is considered to be a special case of the Node2Vec
algorithm
9. API – Application Programming Interface
10. Links or edges – a connection between two nodes
(entities)
11. Accuracy – fraction of total number of instances, that the model predicted
correctly
12. Precision – fraction of positive instances(predicted by the model) that the model
predicted correctly
13. Recall – fraction of actual positive instances that the model predicted
correctly.
14. F1-score – harmonic mean of precision and
recall
15. JSON – JavaScript Object Notation
16. DeepInf-GAT - Graph Attention Network
17. DeepInf-GCN - Graph Convolutional
Network
18. TADW - Text Associated Deep Walk
19. Jaccard similarity – measures the number of common neighbours normalized by
the
total number of neighbors between a node
pair
20. Cosine Similarity – measure the cosine angle between two non-zero
vectors
21. Hadamard product – a matrix operation wherein two matrices A and B with the
same
dimensions are multiplied element-wise to create a new matrix C
where
Ci = Ai ∗ Bi
22. Common neighbor – number of directly connected vertices that two nodes have
in
common
23. Network density – fraction of the number of the maximum possible edges in a
graph
that actually exist in the graph
24. Global clustering co-efficient – an indicator of the tendency of nodes to form
clusters,
on a global scale
25. Node degree – the number of connections to the other nodes (in an undirected
graph)
Impact Of User’s Relationship on User’s Interest
CHAPTER 1
Introduction
1.1 Overview
In the present-day context, a large amount of user-based content is being
generated,
posing multiple challenges such as personalizing user content, recommending
relevant
products, etc. While there are multiple ways to solve these issues, one possible
way is
to use the available social networks to extract user relationships and predict user
interests based on these
relationships.
In this light, we have explored various types of social networks such as explicit
networks, implicit networks and bipartite networks, in terms of extracting user
relationships from these networks via node embedding and then using this
information
to predict user interest.
We consider product based social networks where users are allowed to review
products.
Yelp and Amazon are two examples of product based networks, each willing to
recommend an appropriate product to its users. Since Yelp also allows users to
explicitly form friends, review businesses such as restaurants, and provides a
convenient API to download the required data, it was ideal for creating multiple
types
of networks and studying their impact on user
interest.
Many recommendation systems rely on modeling user behaviour by providing
collaborative filtering [1] from user-item rating matrix alone and some social
recommenders take into consideration the user’s explicit network of
friends.
The problem with using only ratings is that specific attributes of a business, which
are
liked or disliked by a user, are not captured adequately. For example, a user
rated a
restaurant business as 2 because he dislikes the cuisine, and this rating might
deter
another user who loves the cuisine, from visiting the
business.
Department of CSE Jan-May 2019 1 | Page
Impact Of User’s Relationship on User’s Interest
In case of considering only an explicit network, the links of friendships get stale
over
time as users, especially those having a large group of friends, will not interact
equally
with all the friends. Moreover, some users choose not to have friends and simply
use
the web to review products. For such users, an explicit network of friends cannot
be
created.
What we propose is that modeling user behaviour using an implicit network
based on
user similarity to other users, carries more relevant information which may be
used to
improve recommendations. Multiple studies [2] [3] show that implicit networks, or
a
combination of implicit and explicit networks, are better at modeling user
interests.
Recommending an item such as a restaurant, music, etc, to a user can be
modeled as
predicting a link between that item and user. Note that this link does not exist at
present,
but is likely to form in the future. The link prediction task is essentially a machine
learning prediction problem, and has been modelled as binary classification in
our
research.
In order to perform the link prediction, any network information has to be
converted to
a suitable representation in order for the machine learning models to function
well. To
achieve this, we use the network embedding technique to convert the network
(represented as a graph) into a vector format. This vector can be used as input to
the
machine learning models for link
prediction.
1.2 Scope
The project was split into four high level
modules:
1. Network Creation
2. Visualizations of networks (network analysis)
3. Network embedding
4. Link prediction
Three types of networks were
created:
• User explicit network of friends
• User - Business bipartite network
Department of CSE Jan-May 2019 2 | Page
Impact Of User’s Relationship on User’s Interest
• User implicit network
Two types of implicit networks created
were:
• Based on users that visit the exact same businesses (referred to as implicit
network Type 1)
• Based on users that visit similar businesses (referred to as implicit network
Type
2)
• User similarity thresholding model was used to create an edge between a
given
pair of users
Multiple models were constructed using the following six combinations of the
above
networks:
1. User explicit network (baseline model)
2. Implicit Type 1 network
3. Implicit Type 2 network
4. Combination of user explicit network and user-business bipartite
network
5. Combination of implicit Type 1 network and user-business bipartite
network
6. Combination of implicit Type 2 network and user-business bipartite
network
For graph embedding, we will be using state of the art tool - Node2Vec [4]
among the
multiple other graph embedding techniques such as LINE [5] and DeepWalk
[6].
In order to demonstrate the differences in the models built, a simplistic UI
that
captures a user’s interest in a set of businesses and friends was
created.
Limitations of the Project
• We will not be focusing on extensive UI for the demo. Minimal functionality
will be presented, sufficient to compare the output of the six
models.
• We don’t aim to build a social recommendation system, but rather focus on
the
network embedding and link prediction
problems.
• While considering businesses, only restaurants were used and user
similarity
was based on cuisine preference extracted from the restaurants they
visited
Department of CSE Jan-May 2019 3 | Page
Impact Of User’s Relationship on User’s Interest
• For link prediction, machine learning models were limited to Support Vector
Machines
• In case of network embedding, owing to memory constraints (16GB RAM),
the
edgelist file sizes were limited to 6MB to avoid out-of-memory errors and
system crash.
• Running memory-intensive code on cloud instances were considered, but it
proved to be time-consuming to set up and expensive in the long-run and
hence
we were forced to reduce dataset size for them to run on normal
laptops.
1.3 Objective
Our project investigates the following:
1. How well implicit networks can capture user behavior compared to user
explicit
networks.
2. If a combination of the user business bipartite network with the other
networks
improves link prediction performance (predicting user
interest)
Having prepared the required networks, graph embedding was performed using
Node2Vec
The Yelp dataset was used to perform the research. The Yelp dataset contains
users,
explicit friends and reviewing businesses (businesses such as restaurants,
barber shops,
automotive shops, etc). The Yelp platform allows a user to review such a
business, and
connect with other users.
Performance of the six models mentioned in 1.2, at the link prediction task can
be
measured using:
1. Test accuracy
2. Precision
3. Recall
4. F1-Score
Department of CSE Jan-May 2019 4 | Page
Impact Of User’s Relationship on User’s Interest
1.4 Outcomes
Network analysis will be done for user explicit network and the two kinds of user
implicit networks
The results of six models, will be presented in a tabular
column.
We expect the implicit network based models to exceed the baseline model in
performance.
To make it more appealing to an end-user, we propose a simplistic UI that allows
a user
to select friends, click on multiple businesses, say to look at them, and on
registering
this information, we aim to recommend a suitable business by using the six types
of
models.
Note that the UI and functionality is not the main goal of the project, we want to
focus
our efforts and devote most of the time on building the newly proposed
models.
Department of CSE Jan-May 2019 5 | Page
Impact Of User’s Relationship on User’s Interest
CHAPTER 2
Research
Background
Given that our project has components of network creation, node embedding and link
prediction, we review literature that expands our knowledge in these
areas.
We also study literature that is a possible use case of our project using an alternate
approach,
and contrast their approach with
ours.
We will also be looking at papers that come up with an appropriate network embedding
from
the user’s network (explicit and implicit) and use it for the link prediction
task.
2.1 Literature Survey
The following papers present an experiment with implicit networks. They help us
understand
how implicit networks can be created, and how they are useful in solving the
problem.
2.1.1 Analyzing Implicit Social Networks in Multiplayer Online Games
[7]
Introduction
Online multi-player gaming websites need the players to select multiple other
players
to form teams. Selecting the right kind of teammates is crucial for a good
game
experience. For this use case, the paper has explored ways of creating an
implicit
network based
Implementation
For the implicit gaming network, six types of player interactions have been
modelled:
1. Players are present in the same
match
2. Players are present on the same side of a
match.
3. Players are opponents
4. Players who won matches together
5. Players who lost matches together
Department of CSE Jan-May 2019 6 | Page
Impact Of User’s Relationship on User’s Interest
6. Percentages of games overlap
Point 6 can be further divided into percentage matches lost together, percentage
matches won together, etc
The paper proposes an algorithm that takes as input a list of all the players who
have
logged into the game. Using the implicit network, the algorithm then computes
the
group membership for each
player.
Players from the same group are assigned to new matches. If a group is too
large as per
the rules of the game, the group is split into
two.
Conclusion
Using implicit networks for grouping online players for matches provides a better
gaming experience than randomly assigning players to teams while scheduling a
match.
2.1.2 Suggesting (More) Friends Using the Implicit Social Graph
[2]
Introduction
This paper uses the fact that people implicitly form groups just by interacting with
different people, and based on this interaction an implicit network is generated. In
an
implicit group, people do not explicitly specify who should be part of this group.
This
is different from explicit networks where users can specify their group of
friends.
This paper suggests a new algorithm to suggest friends by first generating an
implicit
network based on interactions captured via email (Gmail to be specific). This
interaction based implicit network is expected to capture user affinity towards
other
users, and hence provide a ranking of friends to be suggested (some friends are
ranked
higher meaning that user is more likely to make friends with
them)
Implementation
Creation of implicit network for the Friend Suggest
algorithm
Department of CSE Jan-May 2019 7 | Page
Impact Of User’s Relationship on User’s Interest
The directed network is created by building a user’s egocentric network,
connecting the
user to other users who are either senders of an email to the user, or recipients
of the
email sent by the user.
The edges of the implicit network also have edge weights which are based on the
following three attributes of the interaction:
1. Frequency – Higher the frequency of interaction, higher the priority of the
group
2. Recency – The more recent the interaction, the higher priority it
gets
3. Direction – The interactions initiated by the user is given more priority over
interactions not initiated by him
The paper proposes a metric called the Interaction Rank whose value is decided
by
totaling the number of emails sent or received by a user, to or from an implicit
group.
It also considers this interaction’s recency to set priority for more recent
interactions.
The Friend Suggest algorithm finally uses this weighted, directed implicit network
and
predicts new friends given a set of existing
friends.
The paper suggests an application where the user can select the recipients of the
mail,
and this system can suggest more recipients which haven’t already been
added.
Conclusion
The implicit network was based on certain attributes of user interaction such as
interaction-frequency, initiator of the interaction (resulting in a directed graph)
and the
time of the interaction (how recent was the latest interaction). The Friend
Suggest
algorithm was based on this network wherein given a few initial friends, more
new
possible friends were suggested. While this research was implemented using
Gmail, it
could be extended to any interaction based
network.
2.1.3 Identifying Implicit Relationships between social media users to
support social commerce
[8]
Introduction This paper claims that implicit network is better than explicit
network since users with
similar interests need not interact with each other
directly.
Department of CSE Jan-May 2019 8 | Page
Impact Of User’s Relationship on User’s Interest
In order to form implicit networks, temporal analysis techniques are used, and
this
exceeds in performance compared to models built with explicit
networks.
The similarity to our work is that, for temporal analysis, user features vectors are
formed, representing how active a user is. We create user feature vector based
on
attributes of the business visited by the
user.
However, we do not use temporal analysis but rather use jaccard and cosine
similarity
to compute user similarity. Unlike this research, we also make use of node
embedding
to perform link prediction.
Implementation
Networks were created from
Digg.com
To form an implicit network, each user has a user feature
vector.
User feature vector for a particular user u are created such that an element in the
user
feature vector vi is computed as follows:
N (u) U
vi = i ∑ j Ni(j) where Ni(u) is the number of posts made by user u
on a particular day i.
U is the total number of posts made by all users U on the particular
∑ j Ni(j)
day i
Using the user feature vectors, temporal similarity is calculated between any
given user
pair and a user similarity matrix is formed. The elements of the matrix represent
the
strength of the similarity between two user
pairs.
A thresholding mechanism is used to create edges between users i.e; an edge is
formed
between two users if the similarity exceeds the threshold value. In this manner,
an
implicit network is created.
A Top-K filtering method is also used to create the implicit network. Here, edges
to
other nodes are ranked such that more similar pairs have a higher rank. K is the
percentage of users that must be retrieved from the ranking
list.
Department of CSE Jan-May 2019 9 | Page
Impact Of User’s Relationship on User’s Interest
To create an explicit network which is the baseline model, an edge was made
between
two users who commented on the same story and in another type, edge was
created
between the author of a story and the commentor on the same
story.
Results
The quality of this implicit network is judged using Precision and Recall, and a
table is
presented for different values of thresholding. It also uses the alternative Top-K
filtering
method to create the implicit network.
Precision and Recall is also stated for the two types of explicit networks
created.
This output is compared and the implicit network is seen to have outperformed
the
explicit network.
2.1.4 Measuring Tie Strength in Implicit Social Networks
[9]
Introduction
Calculating tie strength is useful for applications that require ranking edges
between
two nodes. This paper presents a set of axioms that a measure for tie strength
can satisfy.
It compares the various tie strengths such as Common neighbours, Jaccard
similarity,
Adamic-Adar, etc. for the axioms they
satisfy.
The Kendall-Tau Rank Correlation between various tie-strength measures is
presented.
Implementation
The paper mentions the following
axioms:
1. Axiom 1 – Isomorphism : Tie strength between any two given nodes
depends
on the link structure and not on the labels assigned to the
nodes
2. Axiom 2 – In an empty graph, tie strength between any two given nodes in
that
graph will be 0. In a graph where only two nodes are present, with a single
edge
between them, then tie strength will be
1
3. Axiom 3 – Node pairs with more features in common will have higher tie
strength
Department of CSE Jan-May 2019 10 | Page
Impact Of User’s Relationship on User’s Interest
4. Axiom 4 – Intimacy : Node pairs having a smaller total list of features, are
likely
to have higher tie strength
5. Axiom 5 – Popularity : Given a large set of features for a node, more
number of
nodes will be similar to that node
6. Axiom 6 – Conditional Independence of vertices : The tie strength of a
particular
node to other vertices does not depend on features that the node does not
have;
it only depends on features that the node
has
7. Axiom 7 – Conditional Independence of Events : Increase in the strength of
a
tie between two nodes due to a feature, is independent of the effect of
other
features
8. Axiom 8 – Sub-modularity: The marginal increase in tie strength between a
pair
of nodes due to a particular feature Q is at most the tie strength between
the
same node pair if Q was the only feature they
had.
Conclusion
Jaccard similarity satisifies axioms 1-
5.
Table 2.1.4.1 - Axioms satisfied by various similarity measures
The paper presents the following observations about the Kendall-Tau Rank
Correlation:
Common Neighbor and Max show very low values of correlation (-0.2 <= τ <=
0.2)
across the datasets
Department of CSE Jan-May 2019 11 | Page
Impact Of User’s Relationship on User’s Interest
Adamic-Adar, Delta, and Linear measure of tie strength show a strong positive
correlation (τ > 0.6) irrespective of the
dataset.
Jaccard similarity is not mentioned as it does not satisfy all the
axioms.
Figure 2.1.4.1 Measuring Kendall-Tau Rank correlations on rankings produced by different pairs of tie-strength
measures on different datasets (a) Adamic-Adar(AA) Linear(L) and Delta(D) (b) Common Neighbour (CN) and Max
(c) Adamic-Adar(AA) Linear(L) and Common Neighbour (CN) (d) Adamic Adar, Max, Delta and Linear
2.1.5 Learning to Recommend with Explicit and Implicit Social
Relations
[3]
Introduction
In the real world, users usually look up to their trusted friends for
recommendations.
This fact is not captured by traditional recommender systems as they assume the
independence and identical distribution of the
users.
This paper proposes a “factor analysis framework”, combining user preference
and his
friends’ preference. This framework is generic and scalable and can be used in
case
explicit trust network of a user is not
available.
Implementation
Department of CSE Jan-May 2019 12 | Page
Impact Of User’s Relationship on User’s Interest
This method takes care of two scenarios – one where user item matrix and an
explicit
friends network is present, and another where only the user item matrix is given.
In the
second case, an implicit network is created using vector space similarity and
pearson
correlation co-efficient are used on the user item matrix. A maximum ‘K’ number
of
users will be similar to a given user.
Conclusion
A large value of K will introduce some noise in the trust network and a small
value of
K will not suffice to effectively represent trust network. Commonly, sparse data is
an
issue, and by also implementing diffusion of information in a network, the data
sparsity
issue can be overcome.
2.1.6 Search Lens [10]
Introduction
SearchLens is a novel tool, where users can create a collection of “Lenses” that
captures
their interests. The aim is to enable users to use various combinations of lenses
in order
to search for more relevant items in various diverse contexts such as suggesting
places
to visit for a traveler in a new city, suggesting appropriate
products.
A key feature of SearchLens is transparency owing to its ability to generate user
personalized views with visual
explanations.
Implementation
While this research has the potential to be applied to numerous contexts, it has
been
limited to restaurant reviews.
A Lens is defined as a reusable set of weighted keywords that efficiently capture
user
interest, and can be used in different configurations to suit users’
requirements.
The Lens model was implemented on a subset of the Yelp Dataset of 48,000
restaurants
and 2 million reviews.
Upon the creation or enabling of a Lens, a visual representation is shown to the
user,
displaying the quality of match between the Lens criteria and the content. For
example,
displaying the frequency of occurrence of all the keywords in the Lens present in
the
given reviews.
Department of CSE Jan-May 2019 13 | Page
Impact Of User’s Relationship on User’s Interest
Conclusion
SearchLens uses the concept of Lenses to capture user interest. Providing a
visualization saves the user the effort of explicitly specifying keywords to
search.
This is a keyword based extraction model and no creation of networks takes
place. Since
keywords are used, Word2Vec is used to convert them to a meaningful vector. A
model
is trained on this vector, producing similar vectors for semantically similar
words.
2.1.7 Graph Embeddings for Social Network Analysis: State of the Art
[11]
Defines what graph embedding is, and lists how certain types of graph
embedding
methods can be used in machine learning algorithms, such as node
classification.
Explains how neural networks can improve node embedding, and how we can
use tools
created for word embedding for graph embedding: Word2Vec, Continuous Bag of
Words, and Skip Gram.
This work also details major graph embedding algorithms, such as DeepWalk,
LINE,
node2vec, and compares experimental results for all the three algorithms on
three
different datasets.
Finally, the text discusses techniques for bipartite graph embedding, community
structure preservation and dimensionality reduction in
brief.
2.1.8 Representation Learning and Graph Classification
[12]
This paper claims that successful embedding methods for the heterogeneous
should
jointly consider the multiple content types and structural properties of the graphs.
However, different vertex types have different semantics, hence are
incomparable in
general. Therefore, using methods such as DeepWalk or LINE that does not
discriminate among different relation types result in a poor
performance.
This paper also provides a description of Graph Representation Learning for
Heterogeneous network and proposes an approach to tackle this by separating
the
networks of different types and jointly learn a representation for
them.
Department of CSE Jan-May 2019 14 | Page
Impact Of User’s Relationship on User’s Interest
The following papers cover different aspects of graph embedding and different
algorithms with
more detail, as well as experimental
results:
2.1.9 DeepInf: Modeling Influence Locality in Large Social Networks
[13]
Focuses on using Deep Learning for Graph Representation Learning to predict
social
influence i.e.; predict user’s behaviour based on that of his neighbours, taking
into
consideration the vertex level
features.
Two variants are proposed - DeepInf-GCN(Graph Convolutional Network) and
DeepInf-GAT(Graph Attention) ; GCN performs worse than other approaches
(due to
homophily assumption)
Limitation of this approach is that the ego-network features are not used in these
two
models and hence important information might be
missed
2.1.10 Attributed Social Network Embedding [14]
The proposed model in this paper considers network attributes as well as
structure.
The Social Network Embedding framework proposed can learn more informative
representations, achieving substantial gains on the tasks of link prediction and
node
classification. Specifically, SNE significantly outperforms
node2vec.
The final SNE model integrates the models of structures and
attributes.
2.1.11 Network Representation Learning with Rich Text
Information[15]
Focusses on text-associated DeepWalk, a network representation learning
method
based on the state-of-the-art DeepWalk. The paper proves that DeepWalk
algorithm
actually factorizes a matrix M and figure out the closed form of M. Introduces text
features into Netwrok Representation Learning and gets a 5% to 20% advantage
as
compared to other baselines especially when the training ratio is small.
Experimental results on three datasets with different training ratios show the
Department of CSE Jan-May 2019 15 | Page
Impact Of User’s Relationship on User’s Interest
effectiveness and robustness of TADW as compared to other baselines.
2.1.12 Heterogeneous Network Embedding via Deep Architectures
[16]
This paper describes a deep embedding algorithm for networked data, using a
highly
nonlinear multilayered embedding
function.
The Heterogeneous Network Embedding Framework is presented
mathematically; the
embedding process encodes both heterogeneous content and linkage
information to a
multidimensional representation for each
object.
Results for classification, multi modal search and clustering show that the HNE
representation gives the most accurate results, when compared to only content,
links,
etc. being represented in the graph embedding
structure.
2.1.13 Node2Vec: Scalable Feature Learning for Networks
[4]
Introduction
This paper provides us with the Node2Vec algorithm, where nodes are mapped
to
feature vectors while maximizing the probability of keeping the network structure
intact. It has various parameters to control the random walk and therefore works
on
diverse network structures. This flexibility helps in learning more meaningful
vectors.
While proposing a novel method for graph representation learning, node2vec is
compared to existing state-of-the-art techniques on the tasks of multi-class
classification and link prediction.
Implementation
The Node2Vec algorithm is as follows:
Department of CSE Jan-May 2019 16 | Page
Impact Of User’s Relationship on User’s Interest
Figure 2.1.13.1 – The Node2Vec Algorithm
We see parameters like walk length, dimensions of the output vector for the node
or
edge embedding, return parameter that controls the likelihood of re-visiting a
node, and
the in-out parameter, which controls the depth and breadth of the
search.
A high value for the in-out parameter encourages the algorithm to explore the
network
further away from the node in question. A low value encourages forming the
representation of the local network.
Conclusion
Table 2.1.13.1 shows Node2Vec’s highest Area Under Curve score
recorded.
This paper also compares Node2Vec to other embedding algorithms such as
LINE and
DeepWalk for link prediction task, with Node2Vec outperforming each of
those.
DeepWalk algorithm is considered to be a special case of
Node2vec
Given this information, we will be using Node2Vec for the embedding
process
The same authors have a GitHub repository with python 2 implementation of
Node2Vec [17]. However, to maintain uniformity in software versions, we will use
a
python 3 implementation of Node2Vec
[18]
Department of CSE Jan-May 2019 17 | Page
Impact Of User’s Relationship on User’s Interest Facebook Protein-Protein
Algorithm Dataset Interac
arXiv
Department of CSE Jan-May 2019 18 | Page
Spectral Clustering 0.612 0.4920 0.5740
DeepWalk 0.9680 0.7441 0.9340
LINE 0.9490 0.7249 0.8902
Node2Vec 0.9680 0.7719 0.9366
Table 2.1.13.1 Area Under Curve Scores for Link Prediction using the Hadamard operator. Node2Vec has the
highest scores
The following sections provide a summary on literature reviewed for Link prediction
approaches
2.1.14 Link Prediction in Heterogeneous Social Networks
[19]
Link Prediction is more complex in heterogeneous networks where there are
different
types of links. This paper takes a “multi-task metric learning” approach to link
prediction wherein each link type has a particular distance measure computed
based on
network and node features.
This paper describes the limitations of the current unsupervised and supervised
approaches as being unable to pick out relevant features from large noisy
datasets.
While the novel models proposed in this research are robust to noisy data, its
performance on large number of link types is not
known.
2.1.15 A Link Prediction Approach to Recommendations in Large-
Scale
User-Generated Content Systems
[20]
In user-generated content system, the amount of data is very large, and a K-
Nearest
Neighbours algorithm employed on a user item matrix will not scale well. The
experiments conducted in this paper show that considering the immediate
neighbourhood rather than the global network structure is better for performing
link
prediction. Using explicit networks for link prediction only marginally improves
performance.
Impact Of User’s Relationship on User’s Interest
The paper explores six link prediction
methods:
i. Adamic-Adar
ii. Common Neighbour
iii. Global Popularity
iv. Page Rank
v. Katz Measure
vi. Rooted Page Rank
Among all the six Link Prediction algorithms, the node-neighborhood based
algorithms
i.e. common neighbor and Adamic Adar outperformed the others. They also
outperformed the traditional collaborative filtering
method.
Department of CSE Jan-May 2019 19 | Page
Impact Of User’s Relationship on User’s Interest
CHAPTER 3
Methodology
3.1 Proposed Approach
a. We will essentially compare the effectiveness of user explicit network alone
vs.
implicit networks
b. We will also investigate the combination of the explicit and implicit networks
with user-business network
features
c. User being interested in a particular business will be modelled as a link
prediction task
d. Baseline model : with user explicit network embedding
alone
e. Non-baseline models:
▪ Implicit Type 1 network
▪ Implicit Type 2 network
▪ Combination of user explicit network and user-business bipartite
network
▪ Combination of implicit Type 1 network and user-business bipartite
network
▪ Combination of implicit Type 2 network and user-business bipartite
network
3.2 Improvement over Alternative Approaches
1. Recommendation systems rely on modeling user behaviour by providing
collaborative filtering [1] from user-item rating matrix alone and some social
recommenders take into consideration the user’s explicit network of
friends.
2. The problem with using only ratings is that specific attributes of a business
which
are liked or disliked by a user are not captured adequately. For example, a
user rated
a restaurant business as 2 because he dislikes the cuisine, and this rating
might deter
another user who loves the cuisine, from visiting the
business.
3. Current generation social recommenders rely on explicit social networks
but:
Department of CSE Jan-May 2019 20 | Page
Impact Of User’s Relationship on User’s Interest
o Links in explicit social network become stale with time
o Links in an explicit social network do not reflect behavior dimension
o Not every user has an explicit network of friends
4. Ourapproach using implicit networks based on user similarity, has more
relevant
information than explicit network, since users with similar interests are
grouped
together.
5. We will rely on network embedding that tries to translate network neighborhood
into a well-meaning vector
Department of CSE Jan-May 2019 21 | Page
Impact Of User’s Relationship on User’s Interest
3.3 High Level System Architecture
As stated earlier, there are four major modules in this
project:
1. Network Creation
2. Visualization
3. Node Embedding
4. Link Prediction
Figure 3.2.1 – Module dependencies. The downward arrow shows that module at a lower level is dependent
on the
module at the higher level. Modules at the same level are independent of each
other.
Module interactions
1. The module Network Creation is the base module, and all other modules are
dependent
on this modules’ results. Without implementation of this module, the others
cannot be
implemented.
2. The Visualization module uses data from the Network Creation
Module
3. The Visualization and Node Embedding modules are independent of each
other.
Department of CSE Jan-May 2019 22 | Page
Impact Of User’s Relationship on User’s Interest
4. The Link Prediction Module is dependent on the Node Embedding Module’s
results.
Without performing Node Embedding of the four networks, there will be no data
available to the Link prediction module.
Figure 3.2.2 – A high level architecture diagram showing dependencies among the various modules. The
network
creation task has been split into Yelp Dataset (source), User Explicit Network (a sub-module of network
creation
task) User Business Bipartite Network (another sub-module) and User Implicit Network (sub-module of
network
creation). The submodules User Explicit Network and User-Business Bipartite Network are shown on the
same level
implying that their implementation is independent of each
other.
Department of CSE Jan-May 2019 23 | Page
Impact Of User’s Relationship on User’s Interest
1. Yelp Data is present in a json format and this is the input to the
system
2. Next stage is the data processing step where we use this input to create the user-
explicit
network and the user-business bipartite
network
3. Next stage is another data processing stage that uses the output from the
previous two
stages. Here, the user implicit network is formed based on users obtained from
the user
explicit network, and similarity is calculated based on the reviews written by the
user
which is captured in the user-business bipartite
network
4. Next Stage is graph/network (or node) embedding. We apply this process to all
the three
networks that have been formed in the previous step. We will be using Node2Vec
for
this, as it has been shown to outperform( or perform as good as) other
embedding
algorithms
5. Once the networks have been embedded in their respective vectors, we use this
information for the link prediction task using machine learning
models
6. The output will be the comparison of performance among the three models
generated
Department of CSE Jan-May 2019 24 | Page
Impact Of User’s Relationship on User’s Interest
CHAPTER 4
Environment
Requirements
4.1 Hardware Requirements
The large size of our dataset and resulting amount of computation have resulted
in these
requirements for hardware
specifications:
• Current System Usage:
o Memory: 16GB RAM
o Intel i7 6th Gen processor
o No GPU
• Ideal System Usage:
o Memory: 32GB RAM
o Processors, number of CPUs: Intel i7 8th Gen
o GPU: Nvidia Tesla
4.2 Software Requirements
4.2.1 Anaconda
• Freeand open-source distribution of the Python and R programming
languages
for scientific computing and package management and deployment, used
to
handle jupyter notebook
• Version: 5.0.2
4.2.2 Jupyter Notebook
• Jupyter Notebook is a web-based interactive computational environment for
creating IPython documents, using the open-source IPython
library
Department of CSE Jan-May 2019 25 | Page
Impact Of User’s Relationship on User’s Interest
• Version: 5.7.0
4.2.3 Python
• Programming language used for all graph and network
computation
• Version: 3.5
• Python libraries used:
▪ pandas
▪ numpy
▪ pickle
▪ json
▪ networkx
▪ node2vec
▪ sklearn
▪ matplotlib
o Pip3 for installing any packages that are not provided by default in the
Anaconda environment
4.2.4 User Interface and backend
o JavaScript
o HTML5
o CSS3
o backend - Flask
Software such as anaconda, python can be run on multiple operating systems and
hence there
is no OS requirement.
In case of Node2Vec, parallelization of generating walks, is not supported on Windows,
but
this does not impact the accuracy of the code. On Linux based operating systems such
as
Ubuntu, where it is claimed that Node2Vec parallelization works, joblib cannot perform
task-
splitting for a very large dataset
size.
Department of CSE Jan-May 2019 26 | Page
Impact Of User’s Relationship on User’s Interest The
User Interface is supported on all latest
versions of browsers like Chrome, Edge, Mozilla
Firefox.
4.3 Data Requirements
The dataset we will be working on is a bipartite product-based social media
dataset that:
• Offers both explicit user and item attributes
• Offers implicit user attributes ( text data such as reviews)
• Offers both explicit and implicit network
• Facilitates forming suitable ML problem (link prediction) such as visiting a
restaurant
The dataset we have decided to use, is from the ongoing Yelp Dataset Challenge for
Graph
Mining and contains users, explicit friends and reviewing businesses (businesses such
as
restaurants, barber shops, automotive shops, etc). The Yelp platform allows a user to
review
such a business, and connect with other
users.
The Yelp Dataset consists of the following JSON
files:
1. user.json – list of users along with information about friends, user id, review
count,etc
2. business.json – list of business and multiple attributes such as business id,
average
reviews, review count, business
attributes
3. review.json – list of all the reviews posted by users for the
businesses
4. tips.json – comments from users about what’s a businesses’
specialty
Our project uses JSON files 1, 2 and
3.
4.3.1 User.json
a. Original dataset has 1.7 million users
b. Size of this dataset is 2.03 GB
c. The following features are present in the
dataset
i. User ID
ii. Name
iii. Review count
Department of CSE Jan-May 2019 27 | Page
Impact Of User’s Relationship on User’s Interest
iv. Yelping Since (when the user joined Yelp)
v. Friends –list of user ids who are friends with the
user
vi. Fans
vii. Elite – list of years in which a user was elite
viii. Average Stars
ix. Various compliments received from other users such as writer,
photos,
funny, cool
x. Number of useful, funny, cool votes sent by the
user
d. For filtering such a large network, review count above 100 was chosen as
the
criteria
e. For creating the user explicit network, only User ID and Friends features
were
used.
Figure 3.4.1- A sample user JSON from user.json
file
4.3.2 Business.json
f. The original dataset has 188593
businesses
g. Size of the dataset is 139 MB
h. The following features are present in the
dataset
i. Business ID
ii. Business Name
iii. Business Address with Postal Code, City and
State
Department of CSE Jan-May 2019 28 | Page
Impact Of User’s Relationship on User’s Interest
iv. Latitude and Longitude
v. Stars – averaged ratings across all reviews
received
vi. Review count – number of reviews for this
business
vii. Is open (0 – if closed, 1 if open)
viii. Attributes such as Restaurant Offering takeout, Business Parking
types
ix. Categories such as Mexican, Burgers, Italian,
etc
x. Hours – operating timings for restaurant on each day of the
week
i. This data is used to create user feature vectors and the given granularity of
the
data a lot of these attributes can be used to form the user
vector
Figure 3.4.2 – A sample business JSON from business.json
file
4.3.3 Review.json
j. The original data has 5996996(nearly 6 million) reviews
k. Size of the Dataset is 4.39 GB
l. The following features are present in the
dataset
i. Review ID
Department of CSE Jan-May 2019 29 | Page
Impact Of User’s Relationship on User’s Interest
ii. User ID
iii. Business ID
iv. Review Rating
v. Review text
vi. Date of the review
vii. Number of useful votes received
viii. Number of funny votes received
ix. Number of cool votes received
m. For creating the user business network, this review dataset was filtered
using
the user id as we wanted only those users with review count more than
100.
Figure 3.4.3 – A sample review JSON from review.json
file
Yelp Data is available at the Yelp website [21] or can be downloaded from Kaggle
[22].
Department of CSE Jan-May 2019 30 | Page
Impact Of User’s Relationship on User’s Interest
CHAPTER 5
Demonstration of
Outcome
1. Analysis and visualization of the three
networks
a. We will provide information such as clustering coefficient, type of network
for
all the three networks
b. We will visualize the network using networkx graph visualization
methods
2. Tabular column with node2vec parameters used to embed the four
networks
a. Number of walks
b. Length of the walk
c. P
d. Q
e. Dimension of output vector
3. Tabular format for result comparison at link prediction task, reporting the
following:
a. Test accuracy
b. Precision
c. Recall
d. F1-Score
4. User Interface
a. We will create a simplistic UI that allows a user to
i. view all the businesses
ii. click on multiple businesses, say to review them
iii. select friends
iv. see recommended business suggested by the six
models
b. The figure below shows what our UI offers to an end-
user
Department of CSE Jan-May 2019 31 | Page
Impact Of User’s Relationship on User’s Interest
Figure 5.1 – Use case diagram of the User
Interface
Department of CSE Jan-May 2019 32 | Page
Impact Of User’s Relationship on User’s Interest
CHAPTER 6
Proposed
Approach
6.1 Creating User Explicit network of friends
a. Yelp user.json dataset contains 1.5 Million users which is too large to be
processed by
a typical machine (hardware
constraint)
b. Hence the users are first filtered based on the number of reviews they have
published
(considered to be 100 and
above)
c. This is based on the assumption that eliminating users who have published very
few
reviews will not cause any information loss. Essentially, we look at a subset of
the
possible graph containing users that publish many
reviews.
6.2 Creating User Business bipartite network
a. Again, we consider only the those users’ reviews who have published more than
100
reviews (dependency on user explicit network
data)
b. Each review present in the business.json after filtering, is represented as an edge
between that particular user and business
node
6.3 Creating Implicit Network Type 1
The implicit network type 1 is based on similarity of users visiting the same
business
The following algorithm was used to create the
network:
a. For each user pair U1, U2 in filtered
user.json
a. Get set S1 of businesses reviewed by U1 from user business
network
b. Get list S2 of businesses reviewed by U1 from user business
network
c. Compute Jaccard similarity J for (U1, U2) as follows:
i. J = | S1 ∩ S2 | |S1
U S2|
d. If J >= 0.5, create an edge between U1 and
U2
Department of CSE Jan-May 2019 33 | Page
6.4
Impact Of User’s Relationship on User’s Interest
Creating Implicit Network Type 2
The implicit type 2 network is based on similarity of users who visit similar
restaurants
Here, business attributes such as cuisines are considered. For example, users who visit
Italian
restaurants are considered similar
users..
A user feature vector is created which consists of the number of businesses visited by
the user
categorized by cuisines. For example, if a user has visited 3 businesses that serve
Italian cuisine,
2 with American cuisine, 5 with Mexican cuisine, then the user vector would look
like
[3 , 2 , 5]
Note that a business that offers multiple cuisines will be counted multiple times. For
example
a restaurant that offers Italian and American cuisines but not Mexican would have a
vector like:
[1 , 1 , 0]
This business vector is added to the user
vector.
6.5 Node Embedding
Embedding was performed for each of the four networks. The Node2Vec parameters
used are
presented as follows:
• For all four networks in-out parameter was chosen to be
1
• For all four networks the return parameter was chosen to be
1
• For all four networks, the walk length was chosen to be
30
• The differences in parameters are summarized
below
Network Output Dimension Number of walks
User explicit 64 200
User business 32 100
Implicit type 1 64 200
Implicit Type 2 32 100
Table 6.5.1 - Node2Vec Parameters for the four networks
Department of CSE Jan-May 2019 34 | Page
Impact Of User’s Relationship on User’s Interest
6.5.1 Constraints for Embedding
The reason for using fewer walks and a lower output dimension for the user
business
was to not increase dataset dimensionality(more columns) compared to dataset
size(number of rows) when combining it with other network
embedding.
As the implicit type 2 network was much larger than implicit type 1, the dimension
of
embedding vector was kept at 32 in order to reduce RAM consumption when
running
the code for Node2Vec.
While Node2Vec offers a way to load big graphs that may not fit into memory, it
also
slows down the processing of the graph. For example, processing the implicit
type 2
network would take roughly 600 hours when this approach was used. Owing to
very
large processing time, this method was was not
used.
For the user business network and implicit network type 2, which were extremely
large,
the edgelist file (text file containing the list of edges in a network) was restricted
to
6MB in order to run the embedding process with a 15.6GB RAM without running
out
of memory.
The embedding of the nodes in each of these networks were used for link
prediction
task.
6.6 The Link Prediction Problem
a. We will essentially compare the effectiveness of user explicit network alone vs.
implicit
networks
b. We will also investigate the combination of the explicit and implicit networks with
user-
business network features
c. User being interested in a particular business will be modelled as a link prediction
task
d. Baseline model : with user explicit network embedding
alone
e. Non-baseline models:
a. Implicit Type 1 network
b. Implicit Type 2 network
c. Combination of user explicit network and user-business bipartite
network
Department of CSE Jan-May 2019 35 | Page
Impact Of User’s Relationship on User’s Interest
d. Combination of implicit Type 1 network and user-business bipartite
network
e. Combination of implicit Type 2 network and user-business bipartite
network
f. Link prediction task is modelled as a binary classification
problem:
a. Prediction of class 0 implies no edge formed between user and
business
b. Prediction of class 1 implies edge formation between user and
business
g. The presence of an edge between user and business is a positive instance while
the
absence is a negative instance. As both are needed in order to train a model, a
residual
graph of the user-business network was created, and then randomly sampled to
get
required number of edges of the residual graph to roughly balance the number of
positive and negative instances for training and
testing.
h. Train and test data had the following attributes used for
prediction:
a. Embedding for user from explicit network (64 dimension) + business
id
b. Embedding for user from implicit network Type 1 (64 dimension) +
business id
c. Embedding for user from implicit network Type 2 (32 dimension) + business
id
d. Embedding for user from explicit network + Embedding for business from
user
business network (32
dimensions)
e. Embedding for user from implicit network Type 1 + Embedding for business
from user business network
f. Embedding for user from implicit network Type 2 + Embedding for business
from user business network
g. Each instance in the train and test had the node embedding as described
above.
h. All the six combinations were tested for their accuracy on the test
data.
6.7 Demo Approach
Step 1 – Front-end functions:
1. Display a list of users
2. Allow new user to select at least 5 friends
3. Display a list of businesses
4. Allow new user to select at least 5 businesses
5. Once the above conditions are satisfied perfrom steps 6 and 7
Department of CSE Jan-May 2019 36 | Page
Impact Of User’s Relationship on User’s Interest 6. Create a list
of user_id selected by the new user
7. Create a list of business_id selected by the new user
8. Send lists in 6 and 7 to the backend
Step 2- Back end Functions:
1. We have to create 4 networks for the New User
a. User Explicit
b. User Implicit – Type 1
c. User Implicit – Type 2
d. User Business
2. We have to perform embedding for each of the four networks
3. Create Datasets ( 6 different types)
4. Use the saved SVM models to predict ‘Link Column’
5. Whichever Links are predicted as 1, will be recommended to the user
6. Backend will send list of recommended business_id for each of the 6 models
Step 3- Front-end Functions:
1. Create 6 separate div tags for showing the recommended businesses for each
model
2. Display business cards (as done in step 1)
3. Logical view of the recommendations page
Department of CSE Jan-May 2019 37 | Page
Impact Of User’s Relationship on User’s Interest
Department of CSE Jan-May 2019 38 | Page
Figure 6.7.1-A sample view of the UI for recommendations page
Impact Of User’s Relationship on User’s Interest
CHAPTER 7
Detailed Design
7.1 Creating the networks
1. Since the number of users was too large for processing, the users were
filtered
based on the review count (above
100)
2. This filtering resulted in 67000 users
3. After filtering the users, the reviews.json was also
filtered.
4. The reviews written by the same 67000 users were
considered
5. This resulted in a 1.35 GB review file
(filtered_reviews.json)
6. This filtered review file was used to create the user-business
network
7. The business.json file was relatively small and did not require any
filtering
8. For each of the 67000 users, the friends list was extracted and the user
explicit
network was created
9. Using the user-business network, each user pair was taken, and the
jaccard
similarity was computed as follows: J = |A |A
∩ B| ∪ B|
Where A is the set of businesses visited by one user, and B is the set of
businesses
visited by the second user
A ∩ B represents the set of businesses reviewed by
both users
A ∪ B represents the set of all businesses viewed by either one or
both of the users
10. Similarity threshold of 0.5 was used to preserve an edge between two users
i.e. all
user pair with similarity below 0.5 were discarded. This resulted in the
implicit
type 1 network
11. The implicit type 1 network connects users that have reviewed the
same
businesses
12. For the implicit type 2 network, user-business network and the
business.json file
were used to create the user feature vector which was initially
computed
considering the following attributes:
Department of CSE Jan-May 2019 39 | Page
Impact Of User’s Relationship on User’s Interest
i. bar
ii. burgers
iii. sandwiches
iv. thai
v. mexican
vi. italian
vii. american
viii. coffee
ix. tea
x. chinese
xi. asian
xii. barbeque
xiii. french
xiv. indian
xv. pakistani
xvi. middle_eastern
xvii. japanese
xviii. shopping
xix. fashion
xx. flowers
xxi. gifts
xxii. restaurants
xxiii. dessert
xxiv. breakfast
xxv. lunch
xxvi. dinner
xxvii. brunch
xxviii. RestaurantsTakeOut
xxix. BusinessAcceptsCreditCards
xxx. RestaurantsReservations
Department of CSE Jan-May 2019 40 | Page
Impact Of User’s Relationship on User’s Interest
Department of CSE Jan-May 2019 41 | Page
13. Since most of the features used were applicable to restaurant business, we
dropped
features like fashion, flowers, gifts. The restaurants feature was redundant
as we
considered only restaurant
businesses.
14. Upon further inspection, it was observed that attributes like whether a
restaurant
was good for desserts, lunch, breakfast, etc, were extremely sparse as
multiple
restaurants did not have such information. Hence these features (xxiii to
xxvii)
were also dropped.
Figure 7.1.1 A snapshot of the user feature vectors file
15. For each user pair, the cosine similarity was computed as
follows:
U1.U2
cosine(U1,U2) =
√∑ U1i2 ni ∗ √∑ U2i2 ni
where i is each element in the respective vector and n the is vector length.
Note that both vectors must be of the same
size.
16. As there were a very large number of similar users, a similarity threshold of 1
was
used to preserve an edge between a pair of
users.
17. The implicit type 2 network connects users that visit similar types of
businesses
Impact Of User’s Relationship on User’s Interest
Figure 7.1.2 - High Level Flowchart for network creation
7.2 Performing Node2Vec
We initially tried Node2Vec with this data on a 16GB RAM Intel i7 processor, but
the
size proved to be too large.
While Node2Vec allows for parallelization of generating random walks, it does
not
work on Windows OS and hence we had to run it serially, making it a very time
consuming process (over 2 days for
processing)
We tried running Node2Vec on a native Linux (Ubuntu) machine with
parallelization,
but the sub-tasks were still too large to be handled and the process
failed.
Department of CSE Jan-May 2019 42 | Page
Impact Of User’s Relationship on User’s Interest
Department of CSE Jan-May 2019 43 | Page
Node2Vec was run on Google Cloud with a Linux instance, however due to some
issues, the runtime would disconnect from the cloud instance and the process
would
stall.
Since Cloud provided limited credits to run the process for free, and more
powerful
instances were very expensive, this approach of running code was
abandoned
Hence we were forced to randomly remove the edges and use the remaining
edges for
performing node embedding
It was observed that files over 6MB would cause out-of-memory errors when
using
Node2Vec, on the 16GB
Thus, the implicit network type 2 and user business network were filtered by
randomly
sampling the edges.
On performing Node2Vec, embedding for the nodes in the respective networks
were
retrieved.
Figure 7.2.1 High Level Flowchart for Network Embedding using Node2Vec
Impact Of User’s Relationship on User’s Interest
7.3 Link prediction
As stated earlier, 6 models were created and compared for accuracy, precision, recall
and F1-
Score
In each dataset the negative instances (user and business where link does not exist)
were
extracted from the residual graph of the original user-business
network
7.3.1 Model 1: User Explicit Network
The following features were used:
i. Each element of the 64-dimensions embedding vector of a node in the
explicit
network was considered a
feature
ii. Business id.
The Support Vector machine model was trained on a total of 65 features, predicting 0 or
1 class
for the Link
7.3.2 Model 2 : User Implicit Network – Type 1
The following features were used:
i. Each element of the 64-dimensions embedding vector of a node in the
implicit
type 1 network was considered a
feature
ii. Business id.
The Support Vector machine model was trained on a total of 65 features, predicting 0 or
1 class
for the Link
7.3.3 Model 3 : User Implicit Network – Type 2
The following features were used:
i. Each element of the 32-dimensions embedding vector of a node in the
implicit
type 2 network was considered a
feature
ii. Business id.
The Support Vector machine model was trained on a total of 33 features, predicting 0 or
1 class
for the Link
Department of CSE Jan-May 2019 44 | Page
Impact Of User’s Relationship on User’s Interest
7.3.4 Model 4 : User Explicit Network + User Business
Network
The following features were used:
i. Each element of the 64-dimensions embedding vector of a user node in the
explicit network was considered a
feature
ii. Each element of the 32-dimensions embedding vector of a business node in
the
user business network was considered a
feature
The Support Vector machine model was trained on a total of 96 features, predicting 0 or
1 class
for the Link
7.3.5 Model 5 : User Implicit Network – Type 1 + User Business
Network
The following features were used:
i. Each element of the 64-dimensions embedding vector of a user node in the
implicit network – Type 1 was considered a
feature
ii. Each element of the 32-dimensions embedding vector of a business node in
the
user business network was considered a
feature
The Support Vector machine model was trained on a total of 96 features, predicting 0 or
1 class
for the Link
7.3.6 Model 6 : User Implicit Network – Type 1 + User Business
Network
The following features were used:
i. Each element of the 32-dimensions embedding vector of a user node in the
implicit network – Type 1 was considered a
feature
ii. Each element of the 32-dimensions embedding vector of a business node in
the
user business network was considered a
feature
The Support Vector machine model was trained on a total of 64 features, predicting 0 or
1 class
for the Link
Department of CSE Jan-May 2019 45 | Page
Impact Of User’s Relationship on User’s Interest
Department of CSE Jan-May 2019 46 | Page
Figure 7.3.1- High Level Flowchart for Link Prediction
Impact Of User’s Relationship on User’s Interest
CHAPTER 8
Results
8.1 Visualizations of networks
The user explicit network and the user implicit networks (both Type 1 and Type
2) were
visualized in order to compare their network
structure.
To achieve this, we present the visualization of the entire network, the degree
histogram, the Logarithmic graph of degree count. Network parameters such as
the
number of nodes and edges, network density, maximum degree and global
clustering
co-efficients have also been
mentioned.
8.1.1 User explicit network of friends
Figure 8.1.1.1 Degree histogram of the user explicit network
Department of CSE Jan-May 2019 47 | Page
Impact Of User’s Relationship on User’s Interest
Figure 8.1.1.2- Logarithmic plot for degree and count for explicit network
Figure 8.1.1.3- Visual for the entire user explicit network
Department of CSE Jan-May 2019 48 | Page
Impact Of User’s Relationship on User’s Interest 8.1.2
User implicit network type 1
Figure 8.1.2.1- Degree histogram of the user implicit network type 1
The maximum degree here is a little over 200.
Figure 8.1.2.2- Logarithmic plot for degree and count for implicit network type 1
Department of CSE Jan-May 2019 49 | Page
Impact Of User’s Relationship on User’s Interest
Figure 8.1.2.3- Visual for the entire user implicit network type 1
8.1.3 User implicit network type 2
Figure 8.1.3.1- Degree histogram of the user implicit network type 2
Maximum degree here is over 4000.
Department of CSE Jan-May 2019 50 | Page
Impact Of User’s Relationship on User’s Interest
Figure 8.1.3.2 - Logarithmic plot for degree and count for implicit network type 2
Figure 8.1.3.3 Visual for the entire user implicit network type 2
8.1.4 Comparison of the three networks
It is expected to see relatively higher clustering in the implicit networks as users are
grouped
based on similarity.
Network density is poor across all, as only selective edges are
present.
Using Type 2 implicit network, there are over 7 million edges compared to 1 million
edges in
the explicit network and 200,000 edges in implicit Type 1. The high number of edges
can be
attributed to relaxing the similarity constraint – rather than considering exact businesses
visited,
similar businesses are
considered.
Department of CSE Jan-May 2019 51 | Page
Impact Of User’s Relationship on User’s Interest It
is also observed that the Degree-Count
logarithmic plot is not a straight line with positive
slope. This implies that the networks are not real-world
networks.
Parameter User explicit User implicit
User implicit network User implicit
Network of friends network type 1
type 2 network type 1
61725 18231 22668
Number of nodes
1912069 211493 7673278
Number of edges
<1000 ~200 >4000
Maximum Degree
0.00100 0.00127 0.02987
Network Density
Global clustering 0.56169
0.79355
co-efficient
Department of CSE Jan-May 2019 52 | Page
Table 8.1.4.1 – Comparison of Network parameters for the three networks
8.2 Link prediction results
Our best performing model was combination 6 – user implicit network type 2 with
business
node embedding giving us an accuracy of 89% with F1-score for class 1 as 0.901
and F1-
score for class 0 as 0.895.
The Accuracy, Recall, Precision and F1-Scores have been presented for both
classes 0 (no
link) and 1 (link is present)
Impact Of User’s Relationship on User’s Interest (c
(class 1)
Data Combination Accuracy Recall
(class 1)
Precision
(class 1)
F1-Score
F1-Score
User Explicit 54.7% 0.938 0.161 0.275
User Implicit - Type 1 76.6% 0.984 0.541 0.698
User Implicit - Type 2 75.1% 0.553 0.915 0.690
User Explicit with business embedding User Implicit - Type 1 with business
50% Not defined 0 Not
embedding
(model does not predict 1s) 79.1% 0.789 0.996 0.880
defined
Department of CSE Jan-May 2019 53 | Page
User Implicit - Type 2 with business embedding
89.8% 0.924 0.879 0.901
Table 8.2.1- Model Accuracy and Recall, Precision F1-Score for class 1
Data Combination Accuracy Recall (class Precision
Precision
F1-Score (class 0)
(class 0)
(class 0)
User Explicit 54.7% 0.507 0.987 0.670
User Implicit - Type 1 76.6% 0.683 0.991 0.809
User Implicit - Type 2 75.1% 0.948 0.680 0.792
User Explicit with business embedding 50% 0.5 1 0.667
User Implicit - Type 1 with business User Implicit - Type 2 with business
embedding embedding
79.1% 0.884 0.085 0.156 89.8% 0.872 0.920 0.895
Table 8.2.2- Model Accuracy and Recall, Precision F1-Score for class 0
Impact Of User’s Relationship on User’s Interest It
is also likely that dimensionality plays a role
here. User explicit network with the business
network embedding and User Implicit - Type 1 with business embedding had 96
(64+32)
features with which to perform link prediction. Given the relatively smaller size of
the
datasets, and high dimensionality, the accuracy might be reduced. User Implicit -
Type 2
network with business embedding had 64 (32+32) features with which to perform
link
prediction. Since the number of features here is lesser, there is a tendency towards
better
accuracy.
Except for the user explicit network, adding the business node embedding from the
user-
business bipartite graph has impacted accuracy in a positive manner. This shows
that user-
business network combined with implicit network has a very good
performance.
Department of CSE Jan-May 2019 54 | Page
Impact Of User’s Relationship on User’s Interest
CHAPTER 9
Conclusions
Social Networks contain very useful information about users that may be used to
predict their
interests. The ties between two nodes represent a relationship between the two
nodes. This
relationship may be explicit or implicit.
Implicit networks have performed better than explicit network with similar
configurations.
Using the implicit network based on cuisines (Type 2) along with the business
embedding
gives the best results in terms of
accuracy.
Using a combination of networks rather than a single network alone, improves link
prediction
accuracy. This is because of the fact that both node types i.e.; user and business
should be
represented as embedding vectors for the model to perform the link prediction
between the
two.
The dataset dimensionality should be in proportion to the number of instances. Very
high
dimensions may lead to redundancy and may impact the accuracy of the machine
learning
model.
Department of CSE Jan-May 2019 55 | Page
Impact Of User’s Relationship on User’s Interest
CHAPTER 10
Future Work
Creating the user business graph can be more complex than what we have considered,
by using
the ratings as edge weights. Rather than just performing a binary classification for
predicting
links between user and business (0 and 1), we can predict the ratings for that link on a
scale of
1-5.
Rather than considering only cuisines from the set of business attributes, for creating
implicit
network, additional information can be gathered by performing text mining on the
reviews
written by users. This can be used to personalize recommendations for the user, by
extracting
what the user particularly liked and disliked about the
business.
While the current work limited to restaurants, other types of businesses can be
recommended
based on business specific attributes. As an example financial services such as banks
can be
analyzed based on type of customer service, reputation, ease of getting
loans, etc.
Department of CSE Jan-May 2019 56 | Page
Impact Of User’s Relationship on User’s Interest
CHAPTER 11
References
[1] Empirical Analysis of Predictive Algorithms for Collaborative
Filtering
[2] Suggesting (More) Friends Using the Implicit Social Graph by Maayan Roth,
Tzvika
Barenholz, Assaf Ben-David, David Deutscher, Guy Flysher, Avinatan Hassidim, Ilan
Horn,
Ari Leichtberg, Naty Leiser, Yossi Matias, Ron
Merom
[3] Learning to Recommend with Explicit and Implicit Social Relations by HAO MA, IRWIN
KING,
and MICHAEL R. LYU
[4] Node2Vec: Scalable Feature Learning for Networks by Aditya Grover, Jure
Leskovec
[5] LINE: Large-scale Information Network Embedding by Jian Tang, Meng Qu, Mingzhe
Wang,
Ming Zhang, Jun Yan, Qiaozhu Mei
[6] DeepWalk: online learning of social representations by Bryan Perozzi, Rami Al-Rfou,
Steven
Skiena
[7] Analyzing Implicit Social Networks in Multiplayer Online Games by Alexandru Iosup, Ruud
van
de Bovenkamp, Siqi Shen, Adele Lu Jia, Fernando
Kuipers
[8] Identifying Implicit Relationships between Social Media Users to Support Social
Commerce by
Christopher C. Yang, Xuning Tang, Haodong Yang, Ling
Jiang
[9] Measuring Tie Strength in Implicit Social Networks by Mangesh Gupte, Tina Eliassi-
Rad
[10] Search Lens : composing and capturing complex user interests for exploratory search
by Chee
Chang, Joseph & Hahn, Nathan & Perer, Adam & Kittur,
Aniket.
[11] Graph Embeddings for Social Network Analysis: State of the Art by Paul Compagnon,
Kilian
Ollivier
[12] Graph Representation Learning and Graph Classification by Sara
Riazi
[13] DeepInf: Modeling Influence Locality in Large Social Networks by Jiezhong Qiu, Jian
Tang,
Hao Ma, Yuxiao Dong, Kuansan Wang, and Jie
Tang
[14] Attributed Social Network Embedding by Lizi Liao, Xiangnan He, Hanwang Zhang, and
Tat-
Seng Chua
Department of CSE Jan-May 2019 57 | Page
Impact Of User’s Relationship on User’s Interest
[15] Network Representation Learning with Rich Text Information by Cheng Yang, Zhiyuan Liu,
Deli
Zhao, Maosong Sun, Edward Y.
Chang
[16] Heterogeneous Network Embedding via Deep Architectures by Shiyu Chang, Wei Han,
Jiliang
Tang, Guo-Jun Qi, Charu C. Aggarwal, Thomas S.
Huang
[17] https://github.com/aditya-
grover/node2vec
[18] https://github.com/eliorc/node2vec
[19] Link Prediction in Heterogeneous Social Networks by Sumit Negi, Santanu
Chaudhury
[20] A Link Prediction Approach to Recommendations in Large-Scale User-Generated
Content
Systems by Nitin Chiluka, Nazareno Andrade, and Johan
Pouwelse
[21] https://www.yelp.com/dataset
[22] https://www.kaggle.com/yelp-dataset/yelp-
dataset
Department of CSE Jan-May 2019 58 | Page
Impact Of User’s Relationship on User’s Interest
CHAPTER 12
Appendices
12.1 Implicit Network Type 2 after randomly sampling the
edges
to reduce the size
Number of nodes: 4017
Number of Edges: 258843
Network Density : 0.03209008625691409
Global clustering co-efficient :
0.793551512149485
Figure 12.1.1- Degree histogram of the subgraph of user implicit network type 2 after random sampling of
edges from original
Department of CSE Jan-May 2019 59 | Page
Impact Of User’s Relationship on User’s Interest
Figure 12.1.2 - Logarithmic plot for degree and count for subgraph of implicit network type 2
Figure 12.1.3 Visual for the subgraph of user implicit network type 2
12.2 Visualization code
Reading a graph from an edgelist file
Number of nodes and
edges
Department of CSE Jan-May 2019 60 | Page
Impact Of User’s Relationship on User’s Interest
Network Density
Degree distribution histogram
Degree-count logarithmic plot
Global clustering co-efficient
Department of CSE Jan-May 2019 61 | Page
Impact Of User’s Relationship on User’s Interest
Graph Visualization
12.3 Network creation code
Figure 12.3.1- Creating the user explicit network of friends
Department of CSE Jan-May 2019 62 | Page
Impact Of User’s Relationship on User’s Interest
Department of CSE Jan-May 2019 63 | Page
Figure 12.3.2- Creating the user-business review network
Impact Of User’s Relationship on User’s Interest
Department of CSE Jan-May 2019 64 | Page
Figure 12.3.3- Creating the implicit network type 1
Impact Of User’s Relationship on User’s Interest
Department of CSE Jan-May 2019 65 | Page
Figure 12.3.4- Creating the implicit network type 2
Impact Of User’s Relationship on User’s Interest
12.4 Node embedding code
Figure 12.4.1- Performing Node Embedding using Node2vec
12.5 Link prediction code
Figure 12.5.1- Creating the dataset by adding embedding vector
Department of CSE Jan-May 2019 66 | Page
Impact Of User’s Relationship on User’s Interest
Department of CSE Jan-May 2019 67 | Page
Figure 12.5.2- Training SVM Model
Figure 12.5.3- Testing the SVM model and computing accuracy
Impact Of User’s Relationship on User’s Interest
Figure 12.5.4- Computing Recall, Precision, F1-score for each class k
UI snapshots
Figure 12.6.1 UI Snapshot 1 - Business Cards selected by user, are highlighted in Dark blue. The number in
top right corner indicates number of businesses selected
Department of CSE Jan-May 2019 68 | Page