Community Aware Random Walk for Network Embedding
Mohammad Mehdi Keikha1, 2, Maseud Rahgozar1, Masoud Asadpour1
Email: {mehdi.keikha, rahgozar, asadpour} @ut.ac.ir
Corresponding Author: Maseud Rahgozar
Abstract:
Social network analysis provides meaningful information about behavior of network members that can be
used for diverse applications such as classification, link prediction. However, network analysis is
computationally expensive because of feature learning for different applications. In recent years, many
researches have focused on feature learning methods in social networks. Network embedding represents
the network in a lower dimensional representation space with the same properties which presents a
compressed representation of the network. In this paper, we introduce a novel algorithm named “CARE”
for network embedding that can be used for different types of networks including weighted, directed and
complex. Current methods try to preserve local neighborhood information of nodes, whereas the proposed
method utilizes local neighborhood and community information of network nodes to cover both local and
global structure of social networks. CARE builds customized paths, which are consisted of local and
global structure of network nodes, as a basis for network embedding and uses the Skip-gram model to
learn representation vector of nodes. Subsequently, stochastic gradient descent is applied to optimize our
objective function and learn the final representation of nodes. Our method can be scalable when new
nodes are appended to network without information loss. Parallelize generation of customized random
walks is also used for speeding up CARE.
We evaluate the performance of CARE on multi label classification and link prediction tasks.
Experimental results on various networks indicate that the proposed method outperforms others in both
Micro and Macro-f1 measures for different size of training data.
Keywords:
Representation learning, Network embedding, Community detection, Skip-gram model, Link prediction.
1. Introduction:
There has been remarkable growth in online social networks and the number of their users. Valuable
information can be extracted from social networks by analyzing both their structure and content. Machine
learning techniques are used as a way to extract valuable features from social networks for different
analysis tasks such as classification [1, 2, 3], recommendation [4, 5] and link prediction [6, 7, 8, 9]. These
1
2
School of Electrical and Computer Engineering, College of Engineering, University of Tehran, Tehran, Iran
University of Sistan and Baluchestan, Zahedan, Iran
learning methods can be both supervised and unsupervised. Supervised learning algorithms are able to
extract features better for a specific task on social networks but their scalability would be challenging for
large networks. On the other hand, unsupervised methods can handle scalability of feature learning
methods; however, the extracted features show low accuracy in different network analysis tasks. They are
too general to give valuable information for a specific task. [10, 11, 12, 13, 14, 15, 16].
Network embedding, as an unsupervised representation learning task, tries to extract informative lower
dimensional representation of network nodes. It learns social relationships of network nodes in a low
dimensional space to preserve both microscopic and macroscopic network structure including various
proximity orders, community membership and their inherent properties. These representation vectors can
be used in different social network analysis tasks such as classification [17], recommendation [18] and
link prediction [6]. Some of classic network embedding methods use eigenvectors of affinity graph as
feature vectors [10, 15, 19, 20]. Graph factorization is another technique which is used for network
embedding [21]. The aforementioned approaches suffer from scalability for large social networks.
In recent years, deep learning as an unsupervised method is widely used in natural language processing
which a detailed description of these researches can be found in [11]. There are also many researches that
have used deep learning for social network embedding [22, 23, 24, 25]. Network embedding methods try
to represent graph nodes with some informative feature vectors. DeepWalk [22], LINE [23] and
Node2vec [24] are the most important methods that are proposed in the recent years. Though, these
methods show good performance in comparison to other graph representation methods such as Spectural
clustering, but they attempt to extract only local structural information from each node, and then employ
them to learn final representation of the node. However, communities are important structural information
ignored by these methods [26].
Community structure imposes constraints in a higher structural level on the nodes’ representation. The
representation of nodes within a community should be more similar than those belonging to different
communities. Furthermore, for two nodes within a community, even if they only have weak relationship
in local structure due to the data sparsity issue, their similarities will also be strengthened by the
community structure constraint. Thus, incorporating community structure in network embedding can
provide effective and rich information to solve data sparsity issues in global structures and moreover,
make the learned nodes’ representation more discriminative [25].
In this paper, we propose a new network embedding method called “CARE," which utilizes community
information of network nodes to capture more structural information of networks. Some previous
researches tried to embed community information on nodes’ representation. For instance, Grover et al. in
[24] only consider the community members that their distance to the source nodes is less than 2.
However, in real-world networks which communities have thousands of members, Node2vec would not
be able to consider information about the nodes that their distance is more than two from the source of
random walk because Node2vec creates second order random walks. CARE can also produce the
representation vector of nodes for arbitrary type of networks such as weighted, complex and directed.
CARE, firstly, extracts communities of the input network. We prepare this information with the Louvain
method [27] which has effective performance on different social networks. To learn final representations,
we generate some community aware random walks that consider both first and higher order proximities as
well as community membership information for each node. The customized paths contain the nodes that
are in the same neighborhood structure as well as nodes that belong to the same community. CARE
makes several customized paths for each network node to embed different structural information into final
representation. Finally, the customized random walks are used as contextual information to learn final
representation of nodes by the Skip-gram learning model.
CARE is evaluated with two social network analysis tasks: multi label classification and link prediction.
The experimental results show that CARE outperforms Node2vec with a gain of 50% on multi label
classification with BlogCatalog dataset and 3% on the link prediction task for PPI dataset.
To summarize, we make the following contributions:
-
-
-
We present a novel network embedding algorithm named CARE that learns the representation of
nodes for different types of networks such as: weighted, directed and complex networks.
Our method can preserve community information of the network in the learned representation vectors
while the previous researches are not able to define an optimization function considering this
information explicitly.
CARE preserves all properties of the network structure through the generation of customized paths
for each node, independently. Therefore, it spends less time to learn final representations of nodes
because of parallel path generation.
We empirically evaluate the algorithm on multi label classification and link prediction problems with
different real world social networks. The experimental results indicate the efficiency of CARE in
contrast to other network embedding methods.
The rest of paper is organized as follows: In section 2, we summarize related works to network
embedding. We explain details of CARE in section 3. Section 4 outlines the experimental results on two
network analysis tasks. Finally, Section 5 presents conclusion and future works.
2. Related Works:
In this section, we review recent researches related to unsupervised representation learning of network
nodes. Some feature learning approaches use adjacency matrix of the network and try to preserve the first
order proximity of nodes. These researches act as dimension reduction methods and find the best
eigenvectors of network matrices [10, 15, 16, 19, 20, 21, 28] to use as the feature vector of networks.
Eigenvector decomposition is usually computationally expensive. Furthermore, they only consider
immediate neighborhood of nodes and do not use higher order proximities and community information.
So, they are unable to preserve the global structure of networks. As a result, the learned representations
would not provide an appropriate performance on diverse network analysis tasks.
In recent years, deep learning is used as an alternative to learn feature vector of network nodes. These
methods have utilized deep learning to learn representation vectors. They generate random walks with
different graph exploration strategies and have embedded them as contextual information into the Skipgram model. DeepWalk was the first method that used the Skip-gram model [22]. It treats DFS like
search strategy to generate random walk. Despite the good performance on multi label classification, this
method failed to preserve global network structure because it does not consider community information of
network nodes. LINE uses first and second order proximities to learn nodes’ representation, but it also
preserves local information of the networks [23]. The authors in [23] define two independent functions for
first and second order proximities but they ignore community information. LINE and DeepWalk also fail
to learn representation vector for network edges.
Node2Vec makes random walks based on DFS and BFS like strategies [24]. While Node2vec uses two
controlled parameters to consider both homophily [29] and structural equivalences [30] of networks, it
does not guarantee to reach different nodes of a community. The main reason for this problem is that
these algorithms only consider second order proximities and cannot reach the nodes that their distance is
more than 2 from the start node of random walk. Because in real networks, there are many nodes in a
community and obviously their distance is greater than two, thus Node2vec would not consider all the
community members during creation of random walks for a node. SDNE proposes a semi-supervised
deep model, which has multiple layers of non-linear functions, thereby being able to capture the highly
non-linear network structure [31]. It exploits the first and second-order proximity jointly to preserve the
network structure, but it doesn’t use community information.
The proposed method in [25] uses modularized non negative matrix factorization to preserve both
microscopic and macroscopic information of networks. The authors in [25] define two independent model
to embed local and community information independently and then optimize the joint function to learn
the representation of nodes. They learn local and community structure separately. Consequently, they
combine the final representations. Their final representation is not general enough to be used in different
network analysis tasks because It also has some local structure information loss because it combines first
and second order proximities in a unified matrix. Unification of matrices leads to missing information
about different proximities during representation learning. Their method also suffers from scalability
when the networks are large because they should learn many parameters to preserve local and global
structures, thus it is not applicable on real social networks.
Unlike previous researches, we employed a mixture of BFS and DFS like strategies alongside community
information of network nodes without any restriction of search length over search space. We preserve
both local and global information because we use first and higher order proximities as well as community
information of nodes to learn nodes’ representations.
3. CARE: Community Aware Random Walk for Network Embedding
Community information is one of the key features of social networks, which preserves the global structure
of the network [26]. However, it is ignored by the most previous researches in network embedding when
they want to gather information about network nodes. We present a new algorithm to embed graph
structure alongside community information into the learned representation vectors of network nodes.
Therefore, we redefine network embedding as a maximum likelihood problem which is gained by global
network structures. Suppose G = (V, E) is an (un) directed graph which V and E are set of graph nodes
and edges. We are going to find a mapping function f: V Rd which d is the representation size of each
graph node. To obtain the best mapping function f, the Skip-gram model is used [32, 33].
In CARE, first neighborhood structure for each node is extracted from the given network using
community aware random walk strategy. Subsequently, by using the Skip-gram model, the representation
vector of the node is learned from these generated random walks. Most of the previous approaches for
modeling neighborhood structure of a node have only used the first and the second order proximities. In
contrast, we use the nodes that may not have an immediate connection or second order proximity with the
source node. However, they have a homophily relationship which is not presented by the first and the
second order proximities.
Once different neighborhood structures are extracted for all nodes, we use the Skip-gram model similar to
[24] to maximize N(u); where N(u) contains the neighborhood structure of a node u. The Skip-gram
model learns the best representation vector for the node u based on the structural information contained in
N(u). In the following, we explain how we create neighborhood structure and how they are used to learn
social representations of a node in the given network. Algorithm 1 illustrates the steps of CARE
algorithm.
Algorithm 1: CARE (G, 𝑤, d, µ, Ɩ)
Input:
Graph G (V, E)
Window length 𝑤
Representation size d
Number of random walks per node µ
Random walk max length Ɩ
Output:
Matrix of node representations 𝑓 ϵ ꓣ|v| × d
1: Com = CommunityDetection (G)
2: sample 𝑓 from ꓴ|v| × d
3: while (i < µ)
4:
S = shuffle (V)
5:
for each vi ϵ S do
6:
7:
8:
𝒲𝑣𝑖 = CommunityawareRW ( G, vi , Com, Ɩ )
SkipGram (𝑓, 𝒲𝑣𝑖 , 𝑤)
end for
9: end while
In algorithm 1, Line 1 detects the communities of the given graph G. The Louvain method is used for
detecting communities, which is explained in section 3.1. Before we learn the optimal representation
vectors for graph nodes, we generate the matrix U randomly to initialize the representation vector of
nodes in line 2. Now we are able to learn final representation vectors in lines 3-9 of Algorithm 1. For each
node in V, it is generated µ different customized random walks to better capture the global and local
structure of the node in line 4. Before iterating over network nodes, they are shuffled to avoid the effect of
nodes visiting order in the final representations. The core of the presented method for network embedding
is line 6 where we generate customized random walks for the chosen node which would be clarified in
section 3.2. Finally, the generated paths are used to update node representation in line 7. In the following
sections, different functions of Algorithm 1 are explained in more details.
3.1. Community Detection:
We have used the Louvain method to maximize modularity in the network to detect communities [27].
Modularity is a metric to compare the density of edges that are inside a community to the edges between
communities. It is an optimization algorithm that firstly considers each node in a separate community.
Subsequently, a node is chosen and the modularity of joining the node to the neighbors’ communities are
calculated. It finally assigns the node to the community, in which the modularity is maximized.
Modularity in the network is calculated using the following formula:
𝑄=
1
2𝑚
∑𝑖𝑗[𝐴𝑖𝑗 −
𝑘𝑖 𝑘𝑗
2𝑚
] 𝛿 (𝑐𝑖 . 𝑐𝑗 )
(1)
In Eq. 1, m stands for sum of all edge weights in the graph. 𝐴𝑖𝑗 denotes the edge weight between nodes i
and j. Summation of edge weights of i and j are represented by 𝑘𝑖 . 𝑘𝑗 . The communities of i and j are
shown by 𝑐𝑖 and 𝑐𝑗 . Finally 𝛿 is a delta function that returned 1 when communities are equal.
As the modularity maximization problem is intractable, we have first used a heuristic version of the
Louvain method to find communities, then each small community is considered as a node in a new
network, and we try to maximize the modularity with the new network [27].
3.2. Generation of neighborhood structure:
To extract neighborhood structure of a node, we build µ customized random walks. A customized random
walk that starts from node v is shown by 𝒲v. Since a random walk is a path in the given network; we can
denote a customized random walk for node v with some random variables 𝒲v1 , 𝒲v2 , … , 𝒲v𝑘 such that
𝒲v𝑘+1 is a node selected at random from immediate neighbors or the nodes that are in a same community
with k-th node of the path.
To create a customized random walk started from node v, we first extract all of its immediate neighbors.
Then a random variable r between 0 and 1 is generated. If r is less than α, we pick a node at random from
the immediate neighbors; otherwise, we choose a node from the nodes that are in the same community
with k-th node of 𝒲v as it is shown in Eq.2.
𝑖𝑚𝑚𝑒𝑑𝑖𝑎𝑡𝑒 𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑟𝑠
𝒲v𝑘+1 = {𝑛𝑜𝑑𝑒𝑠 𝑖𝑛 𝑐𝑜𝑚𝑚𝑢𝑛𝑖𝑡𝑦 𝑜𝑓 𝑘−𝑡ℎ 𝑛𝑜𝑑𝑒
0<𝑟≤ 𝛼
𝛼<𝑟<1
(2)
If k-th node of 𝒲v be a member of several communities, we first extract all the members of these
communities. Then we choose one of them, randomly.
This process is continued until it reaches a predefined length Ɩ for the path. Furthermore, if a node in the
path has no new neighbor, we stop expanding the path. Algorithm 2 explains the details of forming
customized random walks in CARE.
Algorithm 2: CommunityawareRW ( G, 𝑣𝑖 , Com𝑣𝑖 , Ɩ , α)
Input:
Graph G (V, E)
Source node of RW 𝑣𝑖
Nodes belong to the same community with 𝑣𝑖 Com𝑣𝑖
Random walk max length Ɩ
Random variable to select from neighbors or same community members α
Output:
A path with max length Ɩ
1: initialize RW with 𝑣𝑖
2: while length(path) < Ɩ
3:
if current node has neighbors
if (random (0, 1) < α)
4:
5:
6:
else
select 𝑣𝑗 at random from members of 𝑣𝑖 ′𝑠 communities
7:
8:
else
9:
select 𝑣𝑗 at random from 𝑣𝑖 ′𝑠 neighbors
backtrack in the path and select the last node which has neighbors
that are not in the path
10: end while
The proposed random walk generator extracts local and global information from the given network. In
addition, we can parallelize the algorithm to speed up the process of network embedding. This is because
the customized random walks are made independently from each other. Additionally, if some new nodes
are added to (removed from) the network, their biased random walks are generated without the need to
obtain new customized random walks for previous nodes.
3.3. Skip-Gram:
According to Algorithm 1, after generation of random walks, we use the Skip-gram model to learn
representation of graph nodes [32, 33]. The Skip-gram is a language model that maximizes conditional
probability of words’ co-occurrence in a predefined window 𝑤 as it is shown in Eq.3:
Pr(𝑤 | 𝑓(𝑢)) = max ∏𝑖+𝑤
𝑗=𝑖−𝑤 Pr( 𝑣𝑗 |𝑓(𝑢) )
𝑓
𝑗≠ 𝑖
𝑤 = {𝑣𝑖−𝑤 ¸ … ¸ 𝑣𝑖+𝑤 }\𝑢
(3)
For each node in the given network, we iterate over all its customized paths. We define a window 𝑤 to
slide over a path. Similar to the previous approaches, the independence assumption of conditional
probabilities is considered in Eq. 3. Additionally, Softmax function is used to approximate the probability
distribution of Eq. 3 as the following in Eq. 4:
Pr(𝑣𝑗 | 𝑓(𝑢)) = 1 ⁄ ( 1 + 𝑒 −𝑓(𝑢).𝑓(𝑣𝑗 ) )
(4)
Furthermore, Stochastic gradient descent (SGD) is used to optimize the parameters similar to the
proposed method in [34]. In the beginning of the training, the learning rate is 2.5%, and it decreases
linearly with the number of vertices that are seen so far as stated in [22].
For complex networks, we consider different edges of two nodes independently. If there are more edges
between nodes in the given network, it is more probable to choose these nodes along the path. If the given
network is weighted, we consider weights of edges as a probability to pick the edges when producing
customized random walks.
In large scale networks, CARE prepared different communities with the Louvain method in parallel
settings. The customized paths are obtained by multiple threads simultaneously, since the path generation
for each node is done independently.
Our algorithm is also scalable when some new nodes are appended to (removed from) the network.
Customized random walks are only generated for new nodes, and their representation vectors are
calculated as stated above. We are able to generate the customized random walks in parallel to increase
the speed of CARE. The time complexity of CARE is the same as the Skip-gram model. When the
community detection and path generation phases are finished, the representation learning of nodes is
started. The biggest time complexity among these phases corresponds to node representation learning of
the Skip-gram model.
4. Experiments:
In this section, we evaluate CARE with two supervised learning tasks: multi label classification and link
prediction. We also analyze the effect of different parameters. We compare our results on the
aforementioned tasks with the best representation learning methods, which are explained in section 4.1.
4.1. Baseline Algorithms:
To evaluate the performance of the proposed algorithm, we compare it with the following representation
learning algorithms that have the best results on multi label classification and link prediction tasks:
-
-
-
-
-
Spectral clustering [28]: This algorithm attempts to find graph cuts that lead to better classification
of the graph. Therefore, it first calculates the normalized Laplacian matrix of the graph G. Then, it
considers the d-smallest eigenvectors of the matrix as the best feature vector to represent graph nodes.
DeepWalk [22]: DeepWalk is the first algorithm that uses deep learning for social network
embedding. It generates random walks to learn representation vector of the nodes. DeepWalk could
be considered as a variant of CARE with α = 0 in algorithm 2.
LINE [23]: LINE uses local information, including first and second order proximities of nodes
instead of generating random walks. It, firstly, defines two separate functions to preserve immediate
relations and second order proximities in a social network. In the second stage, two functions are
combined linearly to calculate the final representation of each node.
Node2vec [24]: Node2vec is a semi supervised algorithm that generates a second order random walk
to capture network neighborhood information of nodes. It uses two parameters to simulate BFS and
DFS search strategies.
M-NMF [25]: This algorithm learns final representation of nodes using two independent functions
that generate two matrices. The first matrix keeps information about the local structure of network,
including the first and second order proximities, while the second matrix contains the representation
of network communities. Finally, the Non negative matrix factorization is used to preserve all the
network structural information.
Parameter settings: To compare our results with the above algorithms, we have used the same parameter
settings that are reported in [24] for all the algorithms. We set 𝑤= 10, d= 128, µ= 10, Ɩ=80. The optimal
value for α is 0.2. We have employed the same datasets and experimental procedure as [24]. The best
values for p and q in Node2vec algorithm are chosen from {0:25, 0:5, 1, 2, 4} as stated in [24].
4.2. Multi label classification:
In the multi label classification task, we predict one or more label for each network node. To compare our
algorithm with baseline algorithms, we evaluate the methods with the following datasets:
BlogCatalog [35]: This is a social network of bloggers in which, the node’s labels are topic categories
generated by each blogger. It has 10312 nodes and 333983 edges and 39 different topic labels.
Protein-Protein Interactions (PPI) [36]: This is a subgraph of Homo Sapiens PPI network which is
preprocessed in [24], including 3890 nodes, 76584 edges and 50 labels which are extracted from gene
sets.
Wikipedia [37]: It is a co-occurrence words’ network of Wikipedia’s articles that has 4777 nodes,
184812 edges and 40 different labels. The labels of nodes are part of speech (POS) tags of network nodes.
Table 1 summarizes statistics of datasets that are used in the multi label classification task.
|V|
|E|
Labels
BlogCatalog
10312
333983
39
Protein-Protein Interactions (PPI)
3890
76584
50
Wikipedia
4777
184812
40
Table 1 datasets that used in multi label classification
To learn a classifier, we use a fraction of the learned representations along with their labels as training set.
The rest of the nodes’ representations are used to evaluate the performance of all algorithms in multi label
classification. The regression classifier is used to predict the nodes’ labels of the test set.
4.2.1. Experimental Results:
In the experiments, the training size of input datasets is increased from 10% to 90%. Micro and Macro-f1
measures are applied to evaluate performance of different algorithms [31]. Micro-f1 considers equal
weights to each data instance, while Macro-f1 is a metric which gives equal weights to each class. They
are defined as follows:
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑅𝑒𝑐𝑎𝑙𝑙 =
∑𝐴 𝜖 𝐶 𝑇𝑃 (𝐴)
∑𝐴 𝜖 𝐶 (𝑇𝑃(𝐴) + 𝐹𝑃(𝐴))
(5)
2 ∗ 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
(7)
∑𝐴 𝜖 𝐶 𝑇𝑃 (𝐴)
∑𝐴 𝜖 𝐶(𝑇𝑃(𝐴) + 𝐹𝑁(𝐴))
𝑀𝑖𝑐𝑟𝑜 − 𝑓1 =
(6)
𝑀𝑎𝑐𝑟𝑜 − 𝑓1 =
∑𝐴 𝜖 𝐶 𝑀𝑖𝑐𝑟𝑜 − 𝑓1 (𝐴)
|𝐶|
(8)
In aforementioned formulas, TP(A), FP(A) and FN(A) are the number of true positives, false positives
and false negatives in the instances which are predicted as A, respectively. Suppose C is the overall label
set. Micro-f1(A) is the Micro-f1 measure for the label A.
Figure 1 shows the performance of CARE in comparison to the other methods for multi label
classification task over different networks. In the following, we discuss the experimental results for each
dataset.
-
BlogCatalog:
For BlogCatalog dataset, our method shows significant improvements over both Micro and Macro-f1.
When there are only 10% training data, CARE achieves a gain of 50% over Node2vec. In this condition,
both LINE and Spectral clustering show poor performance because they only use local neighborhood
information. M-NMF does not also provide good performance than deep learning based methods because
it combines first and second order proximity in a unified matrix. Thus, network sparsity cannot be
handled by these algorithms. This property would be useful, especially in sparse networks. The most
important difference between CARE and the other algorithms is the usage of community information
during path generation. As a result, we will be able to maintain both local and global structural
information of the network when learning the representation vector of each node.
-
PPI:
As the results shows in figure 1, both evaluation metrics have less value than BlogCatalog network
because PPI network has lower density in comparison to BlogCatalog. CARE has significant
improvements over Node2vec, DeepWalk and Community preserving methods. When training data is
50%, our method outperforms about 50 % over Node2vec because we utilize community membership
information during nodes’ representation learning. M-NMF also gives weak results in the dataset because
it suffers from local structure information loss as discussed in section 2.
-
Wikipedia:
As another evaluation, we test CARE on the co-occurrence word network of Wikipedia’s articles. The
results of experiments show CARE outperformed baseline algorithms, considering both Micro and
Macro-f1. When training data size reaches 80%, our method achieved the highest improvements of 7%
over Node2vec. Wikipedia dataset is denser than the other datasets. So, M-NMF cannot predict nodes’
labels because it may not preserve local information directly during the learning of nodes’ representation.
60.00
60.00
50.00
50.00
40.00
Macro-f1
40.00
30.00
20.00
30.00
20.00
10.00
10.00
90%
80%
70%
60%
70%
80%
90%
70%
80%
90%
60%
50%
Training size
Training size
35.00
60.00
30.00
55.00
25.00
Macro-f1
65.00
50.00
45.00
20.00
15.00
Training size
60%
50%
40%
30%
10%
90%
80%
70%
60%
50%
40%
30%
35.00
20%
5.00
20%
10.00
40.00
10%
Micro-f1
40%
30%
20%
10%
45.00
40.00
35.00
30.00
25.00
20.00
15.00
10.00
5.00
-
90%
80%
70%
60%
50%
40%
30%
20%
Macro-f1
45.00
40.00
35.00
30.00
25.00
20.00
15.00
10.00
5.00
10%
Micro-f1
PPI
50%
Training size
Training size
Wikipedia
40%
10%
90%
80%
70%
60%
50%
40%
30%
20%
10%
30%
-
-
20%
Micro-f1
Blog Catalog
70.00
Training size
Figure 1 Micro and Macro-f1 scores on Multi label classification task
As it has shown in figure 1, when the network is sparse, there is less local information for nodes and as a
result, most networks embedding methods cannot perform the multi label classification task well. The
average Micro and Macro-f1 measures for different datasets are also reported in Table 2.
BlogCatalog
Algorithm
Micro-f1
Macro-f1
PPI
Micro-f1
WikiPedia
Macro-f1
Micro-f1
Macro-f1
CARE
Node2vec
DeepWalk
M-NMF
60.69
39.98
37.94
35.2
44.09
25.61
20.70
17.66
37.55
19.86
19.77
18.14
34.21
17.37
17.23
12.86
59.33
55.88
51.14
46.17
20.30
16.42
13.80
10.31
LINE
22.27
19.57
8.15
4.45
19.19
14.31
13.68
7.61
49.87
40.12
12.63
4.28
Spectural Clustering
Table 2 Average Micro and Macro-f1 for different datasets
4.2.2. Parameter sensitivity:
In this experiment, the best parameter values for CARE are found on BlogCatalog in the multi label
classification task. In each experiment, we consider default values for all the parameters and change only
one parameter. We also pick 50% of the input network as training set. One of the most important
parameters of our method is α. Figure 2(a) shows the best value of α while other parameters set to default
values. The effect of different size of representation vector is illustrated in figure 2(b). When Micro-f1
reaches to 128, the curve is saturated. Of course, as it stated in [22], µ and 𝑤 can also affect on
representation size.
We have shown the optimal value of µ in figure 2(c). When the number of walks for a node is increased,
we are able to gather more information about that node. This would lead to more coverage of nodes’
neighborhood. Although, after the number of walks reaches 40, the curve is saturated for BlogCatalog and
there is no difference in Micro-f1 when we increase µ more than 40.
80%
60%
60%
Micro-f1
80%
40%
20%
40%
20%
0%
α
(a) Effect of parameter α on CARE
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0%
0
Micro-f1
The Skip-gram model uses a window to extract the relationship of words that are close to each other in a
document. We use the same window to relate the nodes that are located in a path. As figure 2(d) indicates,
by increasing 𝑤, less local information about the nodes in 𝑤 would be embedded into the representation
vector. Therefore, the performance of CARE is decreased.
32
64 100 128 200 300 400 500
d
(b) Effect of different dimensions on CARE
Micro-f1
Micro-f1
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
10
20
40
80
120
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
8
160
10
12
14
16
18
20
𝑤
µ
(c) Effect of µ on CARE
(d) Effect of window size on CARE
Figure 2 effect of different parameters on CARE performance
4.3. Link Prediction:
Link prediction task is a supervised learning problem that attempts to detect some future edges of the
given network. We removed 50% of network edges at random to evaluate performance of CARE in link
prediction. In contrast to Node2vec, we don’t consider connectivity of the remained network after each
edge removal. Node2vec depends on the connectivity of network. Thus, it fails to detect edges of leaf
nodes in the given network. While CARE is able to detect leaf nodes’ neighborhood structures using
community information of them that are embedded in their representation vector though their edge was
removed.
Since representation learning algorithms only generate the feature vector for each node separately, similar
to [24], we also extend our algorithm by different operators to produce an edge representation for edge (u,
v) such that the learned representation has the same size of the representation vectors of source and
destination nodes on the edge. The operators which are considered by CARE, are defined by the
following formulas [24]:
𝐻𝑎𝑑𝑎𝑚𝑎𝑟𝑑:
𝐴𝑣𝑒𝑟𝑎𝑔𝑒:
[𝑓(𝑢) ⊡ 𝑓(𝑣)]𝑖 = 𝑓(𝑢)𝑖 ∗ 𝑓(𝑣)𝑖
[𝑓(𝑢) ⊞ 𝑓(𝑣)]𝑖 =
𝑊𝑒𝑖𝑔ℎ𝑡𝑒𝑑 − 𝐿1:
𝑊𝑒𝑖𝑔ℎ𝑡𝑒𝑑 − 𝐿2:
(9)
𝑓(𝑢)𝑖 + 𝑓(𝑣)𝑖
2
(10)
||𝑓(𝑢) ⋅ 𝑓(𝑣)||1𝑖 = |𝑓(𝑢)𝑖 − 𝑓(𝑣)𝑖 |
(11)
||𝑓(𝑢) ⋅ 𝑓(𝑣)||2𝑖 = |𝑓(𝑢)𝑖 − 𝑓(𝑣)𝑖 | 2
(12)
In the above formulas, 𝑓(𝑢)𝑖 . 𝑓(𝑣)𝑖 are the ith features of 𝑢 and 𝑣 respectively, that are learned by the
representation methods.
We confirm performance of our method in comparison to other algorithms on the datasets which their
statistics are presented in Table 3.
|V|
|E|
PPI [37]
19706
390633
arXiv ASTRO-PH [38]
18772
198110
Table 3 datasets that used for link prediction
To learn a classifier for link prediction, we choose different training sets at random from the existing links
of the network as positive edge set. We also provide negative edge set with the same size of positive edge
set that contain edges that are not in the network. Then, we evaluate different algorithms with the
remaining links of the network. Regression classifier is used to predict different existing and non-existing
links of the network.
4.3.1. Experimental results:
The AUC_ROC score of our algorithm is reported in Table 4. For the link prediction task, the best value
for α is 0.15. We have compared CARE with previous heuristic methods for the link prediction task [24].
These scores consider the number of shared immediate neighbors of nodes in different conditions as the
score for each edge. Comparing the performance of CARE with heuristic methods showed about 14%
improvement on arXiv dataset.
We also compare the performance of our algorithm with some representation learning algorithms, which
are introduced in 4.1. Our method shows 3% gains over Node2vec algorithm on PPI dataset.
Algorithm
arXiv
PPI
Pref. attachments
0.6996
0.6670
Jaccard’s Coefficient
0.8067
0.7018
Common neighbors
0.8153
0.7142
Adamic-Adar
0.8315
0.7126
Spectral Clustering
0.5470
0.4920
M-NMF
0.9028
0.7318
LINE
0.8902
0.7249
DeepWalk
0.9340
0.7441
Node2vec
0.9366
0.7719
CARE
0.9473
0.7966
Table 4 AUC score for different methods on link prediction (All values except for CARE come from [24] )
Our method shows improvements on both datasets in comparison to Node2vec and DeepWalk that have
the best results in representation learning algorithms. DeepWalk and Node2vec generate walks randomly
and there is no information about community of nodes. Although, Node2vec attempts to consider
homophily property using two controlled parameters, but these parameters unable to guarantee to preserve
community information of nodes in a biased random walk. In contrast to Node2vec, CARE embeds this
information into a customized random walk while we jump with probability of α to the nodes that are in
the same community with the last node along the path.
We also investigate the effectiveness of different operators for obtaining the edges representation. The
results for different operators are shown in Figure 3.
0.96
0.94
ROC_AUC
0.92
0.9
0.88
0.86
0.84
0.82
0.8
Hadamard Weighted-L1 Weighted-L2
Average
Figure 3 – AUC_ROC Score for different operators
As it is illustrated in Figure 3, Hadamard operator is the best choice for CARE on the link prediction task.
5. Conclusion:
In this paper, we have presented a novel algorithm for network embedding called CARE. To learn the
representation vector of nodes, we generate some customized random walks as contextual information. In
contrast to previous researches on network embedding methods, we consider both global and local
neighborhood of nodes while creating paths. The Skip-gram model is used in CARE to learn the final
representations of nodes. Our algorithm can embed different types of networks. The proposed method is
robust to nodes addition and removal. It is scalable because it is able to generate and process customized
random walks for different nodes in parallel. We have evaluated CARE on multi label classification and
link prediction tasks. Experimental results on different networks show significant improvements
compared to the state-of-the-art methods on network embedding.
As a part of future works, we plan to create customized random walks while we compute communities of
input network to speed up CARE. We would also like to extend the proposed method to heterogeneous
networks with different types of nodes and relations. In real-world networks, nodes might be in multiple
communities and as another research direction; we would like to investigate the effect of various
community detection algorithms such as overlapping community detection on real-world social networks
using CARE. We also plan to investigate the effect of community aware random walks on Node2vec and
LINE algorithms.
6. References:
[1] L. Getoor, B. Taskar, Introduction to statistical relational learning, MIT press, 2007.
[2] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, T. Eliassi-Rad, Collective classification in network data,
AI magazine, 29(3): 93-106, 2008.
[3] G. Tsoumakas, I. Katakis, Multi-label classification: An overview, Dept. of Informatics, Aristotle University of
Thessaloniki, Greece, 2006.
[4] L. Backstrom, J. Leskovec, Supervised random walks: predicting and recommending links in social networks, In WSDM,
2011.
[5] F. Fouss, A. Pirotte, J. M. Renders, M. Saerens, Random-walk computation of similarities between nodes of a graph with
application to collaborative recommendation, IEEE Trans. on Knowledge and Data Engineering, 19(3):355-369, 2007.
[6] D. Liben-Nowell, J. Kleinberg, The link-prediction problem for social networks, J. of the American society for
information science and technology, 58(7):1019–1031, 2007.
[7] P. Radivojac, W. T. Clark, T. R. Oron, A. M. Schnoes, T. Wittkop, A. Sokolov, K. Graim, C. Funk, Verspoor, et al,
A large-scale evaluation of computational protein function prediction, Nature methods, 10(3):221–227, 2013.
[8] A. Vazquez, A. Flammini, A. Maritan, A. Vespignani, Global protein function prediction from protein-protein
interaction networks, Nature biotechnology, 21(6):697–700, 2003.
[9] S.-H. Yang, B. Long, A. Smola, N. Sadagopan, Z. Zheng, H. Zha, Like alike: joint friendship and interest
propagation in social networks, In WWW, 2011.
[10] M. Belkin, P. Niyogi, Laplacian eigenmaps and spectral techniques for embedding and clustering, In NIPS, 2001.
[11] Y. Bengio, A. Courville, P. Vincent, Representation learning: A review and new perspectives, IEEE TPAMI,
35(8):1798–1828, 2013.
[12] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word representations in vector space, In ICLR, 2013.
[13] J. Pennington, R. Socher, C. D. Manning, GloVe: Global vectors for word representation, In EMNLP, 2014.
[14] S. T. Roweis, L. K. Saul, Nonlinear dimensionality reduction by locally linear embedding, Science, 290(5500):2323–2326,
2000.
[15] J. B. Tenenbaum, V. De Silva, J. C. Langford,A global geometric framework for nonlinear dimensionality reduction,
Science, 290(5500):2319–2323, 2000.
[16] S. Yan, D. Xu, B. Zhang, H.-J. Zhang, Q. Yang, S. Lin, Graph embedding and extensions: a general framework for
dimensionality reduction, IEEE TPAMI, 29(1):40–51, 2007.
[17] S. Bhagat, G. Cormode, S. Muthukrishnan, Node Classification in Social Networks, In: Aggarwal C. (eds) Social Network
Data Analytics. Springer, Boston, MA, 115-148, 2011.
[18] X. Yu, X. Ren, Y. Sun, Q. Gu, B. Sturt, U. Khandelwal, B. Norick, J. Han, Personalized entity recommendation: A
heterogeneous information network approach, In WSDM, 283-292, 2014.
[19] T. F. Cox, M. A. Cox, Multidimensional scaling, CRC Press, 2000.
[20] S. T. Roweis, L. K. Saul, Nonlinear dimensionality reduction by locally linear embedding, Science, 290(5500):2323 2326,
2000.
[21] A. Ahmed, N. Shervashidze, S. Narayanamurthy, V. Josifovski, A. J. Smola, Distributed large-scale natural graph
factorization, In WWW, 37-48, 2013.
[22] B. Perozzi, R. Al-Rfou, S. Skiena, DeepWalk: Online learning of social representations, In KDD, 2014.
[23] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, Q. Mei, LINE: Large-scale Information Network Embedding, In
WWW, 2015.
[24] A., Grover, J. Leskovec, Node2vec: Scalable feature learning for networks, In SIGKDD, 1225–1234, 2016.
[25] X. Wang, P. Cui, J. Wang, J. Pei, W. Zhu, S. Yang, Community preserving network embedding, in AAAI, ,203–209, 2017.
[26] X. Wang, D. Jin, X. Cao, L. Yang, W. Zhang, Semantic community identification in large attribute networks, In AAAI,
2016.
[27] Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, Etienne Lefebvre, Fast unfolding of communities in large
networks, J. of Statistical Mechanics: Theory and Experiment, 10, 2008, doi: 10.1088/1742-5468/2008/10/P10008.
[28] L. Tang, H. Liu, Leveraging social media networks for classification, IEEE Trans. On Data Mining and Knowledge
Discovery, 23(3):447–478, 2011.
[29] S. Fortunato, Community detection in graphs, Physics Reports, 486(3-5):75 – 174, 2010.
[30] K. Henderson, B. Gallagher, T. Eliassi-Rad, H. Tong, S. Basu, L. Akoglu, D. Koutra, C. Faloutsos, L. Li, RolX: Structural
role extraction & mining in large graphs, In KDD, 2012.
[31] D. Wang, P. Cui, W. Zhu, Structural deep network embedding, In SIGKDD, 1225–1234, 2016.
[32] R. I. Kondor, J. La_erty, Diffusion kernels on graphs and other discrete input spaces, In ICML, 2: 315-322, 2002.
[33] F. Lin, W. Cohen, Semi-supervised classification of network data using very few labels, In ASONAM, 192-199, 2010.
[34] L. Bottou, Stochastic gradient learning in neural networks, In Proc. of Neuro- Nimes, France, 1991.
[35] R. Zafarani, H. Liu, Social computing data repository at ASU, 2009.
[36] B. J. Breitkreutz, C. Stark, T. Reguly, L. Boucher, A. Breitkreutz, M. Livstone, R. Oughtred, D. H. Lackner,
J. Bähler, V. Wood, et al, The BioGRID interaction database, Nucleic acids research, 36: 637–640, 2008.
[37] M. Mahoney, Large text compression benchmark, www.mattmahoney.net/dc/textdata, 2011.
[38] J. Leskovec, A. Krevl, SNAP Datasets: Stanford large network dataset collection, 2014, http://snap.stanford.edu/data.