[go: up one dir, main page]

0% found this document useful (0 votes)
89 views53 pages

Clustering Hierarchical2

This document provides an overview of a teaching module that introduces undergraduates to network science and community detection using hierarchical clustering algorithms. The module aims to help students understand how network analysis can advance science and society. Students will learn key network concepts, topological features of graphs, measures for community structure quality, and how to perform agglomerative and divisive hierarchical clustering to detect communities in networks. Hands-on exercises using network analysis software are also provided.

Uploaded by

Anas Riaz Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
89 views53 pages

Clustering Hierarchical2

This document provides an overview of a teaching module that introduces undergraduates to network science and community detection using hierarchical clustering algorithms. The module aims to help students understand how network analysis can advance science and society. Students will learn key network concepts, topological features of graphs, measures for community structure quality, and how to perform agglomerative and divisive hierarchical clustering to detect communities in networks. Hands-on exercises using network analysis software are also provided.

Uploaded by

Anas Riaz Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

Community Detection with Hierarchical

Clustering Algorithms

An Undergraduate Teaching Module


Prepared By:

Donna Beers, Simmons College, donna.beers@simmons.edu


Robert Campbell, College of St. Benedict and St. John’s University, rcampbelliii@csbsju.edu
2

Note to teachers: Teacher notes appear in dark red in the module, allowing faculty to pull
these notes off the teacher version to create a student version of the module.

Summary of the Module:


The goal of this module is to introduce undergraduates from the sciences and social sciences to
the interdisciplinary field of network science. Students will learn how mathematical and
computational tools can be used to analyze complex systems to advance science and prosperity
as well as to promote public health and safety. Students will understand the role of community
detection within network analysis and learn hierarchical clustering methods for carrying it out.

Target Audience:
Second or third-year undergraduates

Prerequisites:
Students beginning this module should know elementary graph theory and matrix algebra.

Topics:
Introductory network science concepts, techniques, technology, applications.

Learning Outcomes:
Students completing this module will be able to:
• State currently used definitions for network, community, and modularity of a network
• State and apply the definitions for the following types of graphs: simple, directed,
undirected, weighted
• State and calculate the following topological features of a graph: degree sequence,
shortest paths, edge betweenness, connected, disconnected, density, cliques
• State and apply the definitions for the intra-cluster density and the inter-cluster density of
for any subgraph of a network
• For a given partition of a network into a family of clusters, state and apply the definitions
used in Albert-László Barabási's book, Network Science, for (1) the modularity of a
cluster in the partition, and (2) the modularity of the whole network
• Describe the agglomerative and divisive community detection algorithms and be able to
apply these algorithms to detect communities in small, engineered networks. Also, be
able to draw the nested diagram and dendrogram visualizations that result from carrying
out community detection with hierarchical clustering algorithms
• Utilize text and social network analyzer and visualization software (Netlytic) to model,
visualize, and describe social networks
• Interpret the findings from the social network analysis and communicate the results from
community detection
3

Anticipated Number of Class Periods:


Three to four classes (50-minute).

Other DIMACS modules related to this module:


● Who Is Really In Charge? An Application of Graph and Network Theory to Analyzing
Social Networks
● Social Media Analysis
● Network Theory and Defense Allocations
4

Table of Contents

1. Introduction 5

2. Examples of Networks and Communities 6


2.1 What is a network? 6
2.2 What is a network community? 7
2.3 What is community detection? 8
2.4 How is community detection used? 8

3. Graph Theory Preliminaries 11

4. The Topology of a Graph 15


4.1 Focus on vertices 16
4.2 Focus on edges 19
4.3 Global attributes of a graph: connectedness and density 23

5. Measuring the Quality of a Community Structure 26


5.1 Interconnectivity and intraconnectivity density measures for subgraphs 27
5.2 For a given partition of a network into a family of clusters, how do we 29
measure the modularity of the partition?

6. Community Detection with Hierarchical Clustering Algorithms 34


6.1 Community Detection with Agglomerative Hierarchical Clustering Algorithms 34
6.2 Community Detection with Divisive Hierarchical Clustering Algorithms 41

7. Technology Tutorial 47
7. 1 Introduction to Netlytic 47
7.2 Creating a Netlytic Account 47
7.3 Basic Activity to see what Netlytic can create 47
7.4 Discussion Questions 48
7.5 Additional Resources 49
7.6 Homework Exercises 49

8. Suggested Projects 50

9. Endnotes 51

10. References 52
5

1. Introduction

Do you have an idea for a Kickstarter project, say a new electronic instrument or a window
farm? Or are you thinking about running for political office? How do you get backers for your
great idea?

Since its launch on April 28, 2009, Kickstarter has backed over 100,000 successfully funded
projects."i Recent examples include Mary Mattingly's floating garden, Swale (see Figure 1
below) a garden on a 130-by-40-foot steel barge; De La Soul’s album, And the Anonymous
Nobody; and Massoud Hassani’s Mine Kafon drone which finds and detonates landmines.

How did these projects get funded? "The majority of initial funding usually comes from the fans
and friends of each project. If they like it, they'll spread the word to their friends, and so on.
Press, blogs, Twitter, Facebook, and Kickstarter itself are also big sources of traffic and
pledges."ii

Figure 1.1 Swale: "A Floating Food Forest in New York City"iii

Whatever your creative idea might be, completing this module will give you important social
network analysis tools for tapping into communities which can support you and help bring your
great idea to fruition!
6

2. Examples of Networks and Communities

2.1 What is a network?


A network is any interconnected system of things. Social networks are networks connecting
people. At birth you instantly (and involuntarily!) joined a family network. By now you may
have chosen to belong to many other social networks, such as, Facebook or LinkedIn or Twitter
news-sharing networks.

The quality of our daily lives depends on the successful functioning of many ny kinds of complex
networks, including communications networks such as the World Wide Web, power networks
like the electrical grid, and transportation networks (e.g., airline, train, and highway networks).

Perhaps the most talked about network today is tthe


he Internet of Things (IoT) (see Figure 2.1). The
IoT "... is the concept of basically connecting any device with an on and off switch to the Internet
(and/or to each other). This includes everything from cellphones, coffee makers, washing
machines, headphones, lamps, wearable devices and almost anything else you can think of." of. iv

Figure 2.1 The Internet of Things (IoT)v

Some of the most promising uses of networks lie in biology where researchers are mapping
neural networks and gene regulatory networks in hopes of understanding and intervening with
mechanisms underlying disease. Other important applications lie in public health
h and marketing,
where network analysis helps to find superspreaders of infectious diseases and influencers who
spread news of new products or innovations.
7

In network science, the things that are connected in a network are called nodes. The connections
between the nodes are called links.
links

2.2 What is a network community?


Are you on Facebook? If so, how long have you belonged? Have you ever visualized your
Facebook network? What clusters in your Facebook would you expect to see?

In 2011, as he was finishing up his Ph.D. at Caltech, Eric Bellm joined LinkedIn and he wrote a
three-part
part blog where he described and displayed social networks to which he belonged -
LinkedIn, Facebook, and Twitter. Today Bellm is Project Scientist at the Palomar Observatory
Observator at
CalTech. Back in 2011 he had belonged to Facebook for seven years. Using the Gephi
visualization tool, here is how he described the communities in his Facebook network, shown in
Figure 2.2:

"While I have not plotted the names for privacy reasons, Gephi
Gephi quite accurately detects the
communities here. Red points are my college friends; red
red-orange
orange are from a college club.
Green points are high school friends, hometown folks, and family. Blue points are grad school
classmates, and peach are members ofof a grad school club. The size of the nodes in this plot scale
with the age of the account; as expected, my college friends (red) have the oldest accounts, while
the newest accounts are generally hometown/family (green)."vi

Figure 2.2 Eric Belmms's Facebook network, June 2011vii


8

2.3 What is community detection?


According to M. E.J. Newman, Anatol Rapoport Distinguished University Professor of Physics
at the University of Michigan, "Loosely stated, [community detection] is the problem of finding
the natural divisions of a network into groups of vertices [called communities or clusters] such
that there are many edges within groups and few edges between groups. What exactly we mean
by "many" or "few", however, is debatable, and a wide variety of different definitions have been
proposed, leading to a correspondingly wide variety of different algorithms for community
detection."viii

2.4 How is community detection used?


"The most common use for community detection," says Newman, "is as a tool for the analysis
and understanding of network data."ix For instance, the community structure in social networks
"can give us clues about the nature of the social interactions within the community represented."x
He adds: "Clusters of nodes in a web graph for instance might indicate groups of related web
pages. Clusters of nodes in a metabolic network might indicate functional units within the
network."xi

The use of social network analysis in political science, to study political participation and
conversation, has made visible the polarization in U.S. politics.

Example 2.1 Communities within the political blogosphere

In their 2005 paper, "The Political Blogosphere and the 2004 Presidential Election: Divided
They Blog," Lada Adamic and Natalie Glance wrote:

"The 2004 U.S. Presidential Election was the first presidential election in the United States in
which blogging played an important role. ...

We analyze the posts of 40 “A-list” blogs over the period of two months preceding the U.S.
Presidential Election of 2004, to study how often they referred to one another and to quantify
the overlap in the topics they discussed, both within the liberal and conservative communities,
and also across communities."xii

The network in Figure 2.3, from Adamic and Glance, shows the community structure of political
blogs (nodes = political blogs; a link from blog A to blog B means blog A referred to blog B. The
colors indicate political affiliation (red = Republican, blue= Democrat). Orange links go from
liberal to conservative, while purple links go from conservative to liberal. The size of each blog
reflects the number of blogs that link to it.xiii
9

Figure 2.3 "Community Structure of Political Blogs"xiv

Example 2.2 The Human Disease Network

In biology,
y, networks are being used to gain insight into the mechanisms that underlie disease. In
his new book, Network Science
Science, Albert-László Barabási, the Robert Gray Dodge Professor of
Network Science at Northeastern University, observes: "Communities play a particularly
important role in understanding human diseases. Indeed, proteins that are involved in the same
disease tend to interact with each other."xv

etwork"xvi shown in Figure 1.5, "each node is a gene, with two genes
In the “Disease Gene Network"
being connected if they are implicated in the same disorder. The size of each node is proportional
to the number of disorders in which the gene is implicated (see key). ... Only nodes with at least
one link are shown."xvii
10

Figure 2.4 Communities of genes associated with the same disorder

We end this section with one of the best known examples of a social network, the Zachary
Karate Club network. This network is used to benchmark virtually every community detection
algorithm.

Example 2.3 The Zachary Karate Club network is named for sociologist Wayne Zacharyxviii who
sought to understand the way in which a university karate club split into two groups after a
disagreement arose between the club's instructor and administrator. The club had 34 members
who, because the club was small, all knew each other. Over the three-year the period 1970-1972,
Zachary recorded the number of regular interactions that were occurring between club members
outside the club. From his data set of 34 nodes and 78 links, he constructed the network of
friendships shown in Figure 2.5. Applying the Ford–Fulkerson algorithm, Zachary correctly
assigned all but one member of the club to the groups they actually joined after the split. As
authors M. Bautista, D. Beers, M. Goodloe wrote in their CCICADA module, Who is Really in
Charge":

"The use of social network analysis and visualization techniques helps us to understand,
after the fact, why the members split the way they did. In the right-hand figure below [Figure
2.6], we can see that the members of the club formed two clusters: friends of person 1 (the
instructor) and the friends of person 34 (the club's administrator). The members of these
clusters had friendships almost exclusively with each other and led to the formation of
two new clubs after the break-up."xix
11

Figure 2.5 Zachary's Karate Club before splitxx Figure 2.6 Karate Club after the splitxxi

Having looked
ooked at a variety of networks and informally iintroduced the notion of community, we
now develop the mathematical ideas and tools
to that will help us to approach community more
rigorously and understand the mathematics behind community detection.

3. Graph Theory Preliminaries

Networks are represented by mathematical models called graphs. Regarding terminology, the
word network is often used when the focus is on a real-world, complex system,
system while graph
usually refers to the mathematical model and its properties.

A graph is an ordered pair (V, E), where V is a set of vertices (nodes) and E is a multiset of
edges (links) between the vertices in V. Throughout this module we let N be the number of
vertices in V which are denoted 1, 2, 3, ..., N. Also, we let L be the number of edges in E.

Example 3.1 The graph G = (V, E) in Figure 3.1 has vertex set, V = {1, 2, 3} and multiset
E = {{1, 1}, {2, 1}, {1, 3}, {1, 3}, {2, 3}}. A loop is an edge that begins and ends at itself.
The graph has a loop from vertex 1 to itself.

Figure 3.1 A graph with a repeated edge and a loop


12

A graph which has no loops and no repeated edges is a simple graph. For the remainder of this
module we will be working exclusively with simple graphs.
*****

Every graph has an equivalent representation called its adjacency matrix that records the nodes
in the graph and the edge connections between them. Specifically, for a graph G with N vertices,
1, 2, ..., N, the adjacency matrix for G, denoted A = [A ij ], is the N x N matrix defined as
follows:
A ij = 1, if there is an edge from vertex i to vertex j,
otherwise A ij = 0, for all i, j, where 1 ≤ i, j ≤ N.

Two vertices that are joined by an edge are adjacent.

Example 3.2 Consider the graph G = (V, E) shown in Figure 3.2.

Figure 3.2 A simple graph

Its vertex set is V = {1, 2, 3, 4, 5} and E = {{1,2}, {1,5}, {2,3}, {2,4}, {2,5}, {3,4}, {4,5}}.
Here is its adjacency matrix:

Vertex 1 2 3 4 5
 1 0 1 0 0 1
 
 2 1 0 1 1 1
 
 3 0 1 0 1 0
 4 0 1 1 0 1
 
 5 1 1 0 1 0 

Activity 3.1 Give the adjacency matrix for the graph in Figure 3.3.
13

Figure 3.3 The complete bipartite graph K 2,3

Answer. The 5 x 5 table below is the adjacency matrix for the graph in Figure 3.3.

Vertex a b x y z
 a 0 0 1 1 1 

 b 0 0 1 1 1
 
 x 1 1 0 0 0
 y 1 1 0 0 0
 
 z 1 1 0 0 0 

A directed graph (digraph) is a graph whose edges have a direction associated with them; a
graph whose edges are two-way is an undirected graph. Among social networks, a Twitter
network is a directed graph (e.g., in a "follows" network, nodes are users of the network; an edge
or arc is drawn from person A to person B if A follows B) while an acquaintance graph (nodes
are users; a link between two users means that they are acquainted with each other) is undirected.
A directed arc from A to B is represented by the order pair (A, B).

The graph in Figure 3.4 is a directed graph.

Activity 3.2 Give the vertex set, the edge set, and adjacency matrix for the following digraph.

Figure 3.4 A directed graph

Answer. The directed graph in Figure 3.4 has vertex set V = {1, 2, 3, 4) and edge set
E = {(1, 2), (1, 3), (3, 2), (3, 4), (4, 3)}. Its adjacency matrix is the following 4 x 4 table.
14

Vertices 1 2 3 4
 1 0 1 1 0 

 2 0 0 0 0
 
 3 0 1 0 1
 4 0 0 1 0 

A graph is weighted if its edges are labeled with numerical values called weights. Weights can
have different meanings, depending on the context.

Example 3.3 In the graph in Figure 3. 3.5, vertices represent U.S. airports, edges represent direct
flightss between the airports, and weights on the edges are the number of air miles between the
airports. Another way to label the edges would be to label each edge with the roundtrip airfare
cost between the connected airports.

3. A weighted graphxxii
Figure 3.5

Example 3.4 The novel Les Misérables,


Mis by Victor Hugo, provides a classic example of a social
network. In this network,, shown in Figure 3.6, the nodes are the characters in the novel, and an
edge is drawn between two nodes if their corresponding characters co-appear
appear in the book.
book This is
a weighted graph where the thickness of an edge between two characters is proportional to the
number of interactions
actions between the actors in the novel.
novel
15

Figure 3.6 Co-appearance network of the characters in Les Misérablesxxiii

4. The Topology of a Graph

A graph is defined by the way in which its vertices and edges are arranged, i.e., by the way its
vertices and edges are 'wired together.'xxiv

Example 4.1. In Figure 4.1, the graphs displayed have the same number of vertices and edges
but they have different structures. For example, in part (a) every vertex is adjacent to exactly two
vertices, but in part (b) the graph has a vertex that is adjacent to four vertices.
16

(a) (b)

Figure 4.1 Two graphs with 5 nodes and 5 links but different structures

We look at the structure of a graph from three perspectives: vertices, edges, and global.

4.1 Focus on vertices

Let G be a graph with N vertices and L edges and denote its vertices by 1, 2, ..., N.

The degree of a vertex is the number of edges that are incident with it. In an undirected, simple
graph, the degree of a vertex ranges from 0, when the vertex is adjacent to no other vertices in
the graph, to N-1, when the vertex is connected to all the other vertices in the graph.

For each vertex i, we denote its degree by k i , where 1 ≤ i ≤ N.

When the degrees k 1 , k 2 , ..., k N are written in nonincreasing order as d 1 , d 2 , ..., d N , we call
this sequence the degree sequence of the graph.

Example 4.2 The graph in Figure 3.2 has vertices 1, 2, 3, 4, 5. Their degrees are
k 1 = 2, k 2 = 4, k 3 = 2, k 4 = 3, k 5 = 3. The degree sequence of the graph is 4, 3, 3, 2, 2.

Activity 4.1 Give the degree sequences for the graphs in Figure 4.1 (above).

Answer. Graph (a) has degree sequence 2, 2, 2, 2, 2. Graph (b) has degree sequence 4, 2, 2, 1, 1.

Undirected graphs
There are two important relationships between the structure of an undirected graph, as
represented in its adjacency matrix A = [A ij ], and the degrees of the vertices. They are as
follows:
17

N
(1) ki = ∑ Aij , for any row i in the adjacency matrix, where 1 ≤
j =1
i ≤ N, and

N
(2) ki = ∑ Aji , for any column i in the adjacency matrix, where 1 ≤
j =1
i ≤ N.

Equation (1) follows from the fact that every row in the adjacency matrix consists only of 0s and
1s. When you add up the numbers in row i, this sum is the number of 1's in row ii, i.e., it is the
number of vertices in the graph that are adjacent to i. Equation (2) may similarly be explained.

An additional relationship, the degree sum formula, holdss true for every directed graph:
N
(3) ∑ ki = 2L
i =1

Equation (3) follows from the fact that ever


every edge between two vertices is two
two-way, so it is
counted in the degrees of both of these vertices.

Activity 4.2
(a) Find the degree sequence of the following bipartite graph (Figure 4.2)

Figure 4.2 A bipartite graph

(b) Draw two graphs which have different structures but which have degree sequence
2, 2, 2, 2, 2, 2. Explain how your graphs are different.

Answers.
(a) 3, 3, 2, 2, 2
(b) The graphs shown below each have degree sequence 2,2,2, 2, 2,2. The graph on the left has
one component while the graph on the right has two components.
18

Example 4.3 Both graphs in Figure 4.3 have degree sequence 3,2,2,2,1,1,1 but they are wired
differently. While each graph has exactly one vertex of degree 3, the vertex of degree 3 in (a) is
adjacent to a vertex of degree one, while the vertex of degree 3 in (b) is not adjacent to a vertex
of degree 1. (It is adjacent to two vertices of degree 2.)

(a) (b)

Figure 4.3 Two graphs with the same degree sequence but different structures

For a directed graph, we distinguish between the indegree and outdegree of a vertex. For vertex
i, where 1 ≤ i ≤ N, the indegree of i is the number of edges entering i, while the outdegree of i
is the number of edges leaving i.

Activity 4.3 The weighted, directed network in Figure 4.4 was derived from observing the
behavior of elephants over a two month period. The elephants observed were in the Chester Zoo,
which is located in the UK. Each node represents an elephant. A link drawn from elephant A to
elephant B means that A initiated play interaction with B. The weight on the link records the
number of times this occurred over the two month period.

Figure 4.4 Graphic from "The application of social network theory to animal behaviour" xxv
19

Construct a table which shows the indegree, outdegree, and total degree for each elephant. Based on your
table, tell which elephant is most active and speculate on the reasons why.

Answer. vertex Tunga Su Up Sit Ram Sheba Birma Thi Jangoli Maya
indegree 4 2 1 2 2 0 0 0 0 0
outdegree 4 2 1 1 3 0 0 0 0 0
total degree 8 4 2 3 5 0 0 0 0 0

Note that Tunga's indegree is 4, his outdegree is also 4, and his total degree is 8. He has the highest
indegree, outdegree, and total degree of all the elephants.

"Tunga is a popular member of the play network, being linked with every other in
this subgrouping, implying that he occupies a central position. The weights ...
indicate that Tunga is the initiator of much play activity, shown by the high values
associated with his outdegrees."xxvi

Note. For the remainder of this module we will restrict our attention to undirected graphs.
*****

4.2 Focus on edges

In the popular culture, the phrase "six degrees of separation" conveys the common belief that
every pair of living human beings is separated by no more than six steps; that is, if we were to
draw an acquaintance network of all living human beings, then "anyone on the planet can be
connected to any other person on the planet through a chain of acquaintances that has no more
than five intermediaries."xxvii We now consider how to define the distance between two vertices
in a graph.

A path between two nodes in a graph is a sequence of vertices (no repeats) π : v 1 , v 2 , ... , v k ,
where consecutive pairs of vertices are edges in the graph. The length of the path is k-1.

A shortest path between two nodes in a graph is a path of shortest length between them. Its
length is called the geodesic distance between the vertices.
20

Example 4.4 Consider the following network.

Figure 4.5 Finding shortest paths

The shortest paths between each pair of nodes are listed in Table 4.1 below.

Find the shortest path(s) between


each pair of nodes
The shortest path from 1 to 2 is π : 1, 3, 2
The shortest path from 1 to 3 is π : 1,3
The shortest path from 1 to 4 is π : 1,3,4
The shortest path from 1 to 5 is π : 1,3,4,5
The shortest path from 1 to 6 is π : 1,3,4,6
pat from 2 to 3 is π : 2,3
The shortest path
The shortest path from 2 to 4 is π : 2,3,4
The shortest path from 2 to 5 is π : 2,3,4,5
The shortest path from 2 to 6 is π : 2,3,4,6
The shortest path from 3 to 4 is π : 3,4
The shortest path from 3 to 5 is π : 3,4,5
The shortest path from 3 to 6 is π : 3,4,6
The shortest path from 4 to 5 is π : 4,5
The shortest path from 4 to 6 is π : 4,6
The shortest path from 5 to 6 is π : 5,6

Table 4.1 Enumeration of shortest paths

A topological measure associated with every edge in a graph is its edge betweenness
etweenness. As we will
see in Section 5,, edge betweenness is the basis for an important class of community detection
algorithms, divisive hierarchical clustering algorithms.
algorithms Edges with the highest betweenness in a
xxviii
network are "bridges." "[W]ithout the edge, paths between many pairs of nodes may have to
"[W]ithout
be “re-routed” a longer way."xxix

The betweenness of an edge is the number of shortest paths between pairs of vertices that pass
through that edge.xxx "For
For nodes A and B connected by a path, assume 1 unit of “flow
“flow.” If there
are k shortest paths between A and B, [d]iv
[d]ivide flow evenly along all possible shortest paths from
A to B - if k shortest paths from A and B, then 1/k units of flow pass along each."xxxi
21

Example 4.5 Consider the graph in Figure 4.5 (above). To find the betweenness of edge 3-4, we
follow these steps: (1) Find all the shortest paths between all node pairs in the graph. (2) For
each shortest path, check whether it passes through the edge 3-4; if affirmative, record the units
of traffic flow for this shortest path. (3) Last, sum up all the units of traffic flow to get the
betweenness of edge 3-4. Table 4.2 shows all of our steps. The betweenness of edge 3-4 is 9.

Find the shortest path(s) between Does it include edge {3, 4}? Units of traffic flow
each pair of nodes
The shortest path from 1 to 2 is π : 1, 3, 2 NO 0
The shortest path from 1 to 3 is π : 1,3 NO 0
The shortest path from 1 to 4 is π : 1,3,4 YES 1
The shortest path from 1 to 5 is π : 1,3,4,5 YES 1
The shortest path from 1 to 6 is π : 1,3,4,6 YES 1
The shortest path from 2 to 3 is π : 2,3 NO 0
The shortest path from 2 to 4 is π : 2,3,4 YES 1
The shortest path from 2 to 5 is π : 2,3,4,5 YES 1
The shortest path from 2 to 6 is π : 2,3,4,6 YES 1
The shortest path from 3 to 4 is π : 3,4 YES 1
The shortest path from 3 to 5 is π : 3,4,5 YES 1
The shortest path from 3 to 6 is π : 3,4,6 YES 1
The shortest path from 4 to 5 is π : 4,5 NO 0
The shortest path from 4 to 6 is π : 4,6 NO 0
The shortest path from 5 to 6 is π : 5,6 NO 0
Total flow = 9
Table 4.2 Calculating the betweenness of edge 3-4

An alternative way for analyzing the betweenness of edges is demonstrated in the presentation,
"Betweenness Measures and Graph Partitioning."xxxii We demonstrate this approach by using it
to compute the betweenness of edge 3-4, and then to compute the edge betweenness of the rest of
the edges in the graph in Figure 4.5.

Here is our analysis:


• Edge 3-4: each pair of nodes between [1-3] and [4-6]; each pair with traffic = 1;
total: 3 x 3 = 9
• Edge 1-3: each pair of nodes between [1] and [3] (no other); each pair with traffic = 1;
total: 1 x 1 = 1;
o similar for edges 2-3 and 5-6
• Edge 4-5: each pair of nodes between [1 - 4] and [5]; each pair with traffic = 1;
total 4 x 1 = 4
o similar for edge 4-6

We arrange the edge betweenness values we have found in the matrix in Figure 4.6 where the
22

(i,j)th entry is the edge betweenness for the edge i-j, for all i and j, where 1 ≤ i, j ≤ 6.

nodes 1 2 3 4 5 6
 1 0 0 1 0 0 0 

 2 0 0 1 0 0 0
 
 3 1 1 0 9 0 0
 4 0 0 9 0 4 4
 
 5 0 0 0 4 0 1
 6 0 0 0 4 1 0 

Figure 4.6 Edge Betweenness Table for Example 4.4

Activity 4.4 Find the betweenness score for each of the edges in the graph in Figure 4.7.

Figure 4.7 Finding the bridges in a networkxxxiii

Answer The following tablexxxiv gives the betweenness score for each of the edges in
Figure 4.7.
23

4.3 Global attributes of a graph: connectedness and density

In everyday usage, a community is a "social unit (a group of people) who have something in
common, such as norms, values, identity, and often a sense of place that is situated in a given
geographical area (e.g. a village, town, or neighborhood)."xxxv

In network theory, a community


ommunity is often described as a subgraph whose nodes are "tightly knit
groups with many connections between the group members and relatively few connections
between groups."xxxvi Alternatively, a community - also called a cluster or module - refers to a
"group of nodes more densely connected to each other than to nodes outside the group. ..."xxxvii

The words "density" and


nd "connected" have topological meaning which we now develop.

When is a graph connected?


A (simple, undirected) graph is connected if there is a path in the graph
aph between any pair of its
vertices;; otherwise, a graph is disconnected.

Example 4.6 In Figure 4.8 you see two connected graphs that have 4 vertices.

Figure 4.8 Two connected


c graphs on 4 vertices

Activity 4.5 Draw two connected graphs, each with 4 vertices, whose wiring diagrams are
different from the ones
es in Example 4.
4.8.

Answers may vary. Possible answers will be among the following:


24

Example 4.7 Here are two graphs, each with 4 vertices, that are disconnected.

Figure 4.9 Two disconnected graphs on 4 vertices

Activity 4.6 Draw two disconnected


connected graphs, each with 4 vertices, whose wiring diagrams are
different
rent from the ones in Example 4.9.

Answers may vary. Possible answers will be among the following:

Activity 4.7 The following graph is disconnected.


disconnected There
here is no path from a node in the triangle to
a node in the line. What is the degree sequence of this graph? Draw a connected graph with the
same degree.

Figure 4.10 Another disconnected graph on 4 vertices

Answer. The graph in Figure 4.10


4. has degree sequence 2, 2, 2, 1, 1. The following is a connected
graph with the same degree sequence:
sequence
.

Example 4.8 The graphs in Figure 4.11 both have degree sequence 3, 3, 2, 2, 2 but they are
wired differently. In the left-hand
hand graph, the two vertices of degree 3 are adjacent; however, the
two vertices of degree 3 are not adjacent.
adjacent
25

Figure 4.11 Two connected graphs with the same degree sequence but different structure

What is the density of a graph?


Informally, the density of a graph (i.e., a simple, undirected graph) measures how tightly knit the
vertices of a graph are to each other
other. Formally, the density of a graph G, δ(G), is the ratio of
the number of its links, L, to the theoretical maximum possible number of links it could have if
all the node pairs in G were linked. In a network
etwork with N nodes, the number of links
links, L, can vary
 N  N(N-1) N(N-1)
from 0 to   = . Adopting Barabási's notationxxxviii we let L max = and obtain the
2 2 2
following formula:

L 2L
(4) δ(G)= =
Lmax N(N-1)

The density of any graph is between 0 to 1 because 0 ≤ L ≤ L max . A graph with density 1,
where all node pairs are linked,
linked is called a complete graph or clique. The symbol K n is used to
denote the complete graph on n vertices. The graphs in Figure 4.12 are the complete graphs K n ,
where 1 ≤ n ≤ 7.

Figure 4.12 Examples of Complete Graphsxxxix

Note. While the definition of the density


d of a graph is unambiguous, the phrase dense graph is
26

ill-defined. For example, consider the following description of a dense graph (the italics are
ours): "A dense graph is a graph whose number of edges is close to the maximal number of
edges. The opposite, a graph with only a few edges, is a sparse graph. The distinction between
sparse and dense graphs is rather vague, and depends on the context."xl

Example 4.9 Figure 4.13 displays three graphs, all with 5 vertices but of varying density. The
graphs, from left to right, have the following density values: 1, 0.8, 0.5.

Figure 4.13 Graphs on five vertices of different density

Activity 4.8 Find the densities for the following "Star", "Ring", and "Line" networks.

Figure 4.14 The "Star", "Ring", and "Line" Networksxli

Answer "Star" and "Line" have density 6/21 while "Ring" has density 7/21.

5. Measuring the Quality of a Community Structure

We recall from Section 2 that M.E.J. Newman's definition of as "groups of vertices such that
there are many edges within groups and few edges between groups" is vague. As Newman points
out, what we mean by "many" or "few" is open to question.xlii

As there is no single definition of "community," this means that the community detection
problem is ill-defined. "There are no universal protocols on the fundamental ingredients, like the
definition of community itself, nor on other crucial issues, like the validation of algorithms and
the comparison of their performances."xliii Different algorithms performed on the same network
can produce different clusters of nodes, depending on their sets of assumptions.
27

5.1 Interconnectivity and intraconnectivity density measures for subgraphs

In order to make precise the intuition that communities in a network are subgraphs whose
vertices are more densely connected to each other than to the rest of the graph, metrics have been
proposed for quantifying the interconnectivity of groups of nodes, as well their intraconnectivity
to other nodes in a network. We present measures that are used by Santo Fortunato, Professor of
Complex Systems in the School of Informatics and Computing at Indiana University.

For any subgraph, C, which has n C nodes and L C links, Fortunato xliv defines two density-related
metrics:

• The intra-cluster density of Cxlv, δint (C) , is given by:


LC
δint (C) = .
n C (n C -1) / 2
• Similarly, the inter-cluster density of Cxlvi, δ ext (C) , is given by:

# of inter - cluster edges of C


δext (C) = .
n C (n C -1)

Fortunato describes their value for community detection (the italics are ours): "For C to be a
community, we expect δ int (C ) , to be appreciably higher than δ (G). On the other hand, δ ext (C )
"has to be much smaller than δ int (C ) . Searching for the best tradeoff between a large δ int (C ) , and
a small δ ext (C ) is implicitly or explicitly the goals of most clustering algorithms."

Again, the terms "appreciably greater than" and "much smaller than" and so these metrics are
indistinct, like the notion of community, are indistinct.

Example 5.1 For the network in Figure 5.1 (below), we look to see if the indicated subgraphs
satisfy Fortunato's expectations for communities. We note the graph has N = 6 nodes and L = 6
6 6
edges. Also, the density of G is δ (G)= = = 0.40.
  15
6
 
2
For each subgraph, we carry out the relevant calculations and comparisons of δ (G), δ int (C ), and
δext (C) and list them in Table 5.1.While the inequalities δ int (C ) > δ (G) and δ int (C ) > δ ext (C ) are in
the right direction for both subgraphs, the vagueness of the meaning of "appreciably greater
than" prevents any further conclusions.
28

Figure 5.1 Are the indicated subgraphs communities?

Subgraph C nC LC # of inter
inter- δint (C ) δ ext (C ) δint (C ) >> δ (G)? δ int (C ) >> δ ext (C )?
cluster edges
Red 3 2 1 0.667 0.1667 0.667 >> 0.40 0.667 >> 0.1667?
Blue 3 3 1 1.00 0.1667 0.67 >> 0.40? 1.00 >> 0.1667?

Table 5.1 The intraconnectivity and interconnectivity of a cluster

Activity 5.1 The network in Figure 5.2 appears in a paper by M.E.J. Newman."xlvii Check
whether the three subgraphs (blue, green, red) meet Fortunato's expectations for a community by
computing and comparing δ int (C ), δ ext (C ), and δ(G).

Figure 5.2 Community structurexlviii


29

27 27
Answer =
G has parameters N = 15 and L = 27. Its density is δ(G)= = 0.2571.
 15  105
 
2
For each subgraph, we carry out the relevant calculations and comparisons of δ(G), δint (C ), and
δext (C) and list them in the table below.While the inequalities δint (C ) > δ(G) and δint (C)>δext (C)
are in the right direction for all three subgraphs, the ambiguity of the phrase "appreciably greater
than" makes it impossible to draw further conclusions.

Subgraph C nC LC # of inter- δint (C ) δ ext (C ) δint (C ) >> δ (G)? δint (C ) >> δ ext (C )?
cluster edges
Blue 4 5 3 0.5 0.25 0.5 >> 0.2571? 0.5 >> 0.25?
Green 5 8 2 0.8 0.10 0.8 >> 0.2571? 0.8 >> 0.10?
Red 6 10 3 0.67 0.10 0.67 >> 0.2571? 0.67 >> 0.10?

5.2 For a given partition of a network into a family of clusters, how do we measure the
modularity of the partition?

One of the most commonly used optimization methods in community detection is modularity
maximization. In Section 6, we will present two community detection methods that are based on
optimizing the modularity of a network.

Modularity measures the tendency of a network to subdivide into modules or communities.


According to Wikipedia, "Modularity is the fraction of the edges that fall within the given groups
[i.e., the modules] minus the expected fraction if edges were distributed at random."xlix

To derive an expression for the expected fraction of edges in the random model , we turn to the
concept of a Configuration Model.

According to Wikipedia:

"The configuration model is a randomized realization of a particular network. Given a


network with [N] nodes, where each node [i] has a node degree [k i ] the configuration
model cuts each edge into two halves, and then each half edge, called a stub, is rewired
randomly with any other stub in the network even allowing self loops. Thus, even though
the node degree distribution of the graph remains intact, the configuration model results
in a completely random network."l

To calculate the probability of there being an edge that connects node i to node j in the random
model, we follow the common practice of introducing the notion of "stubs" or half-edges. Under
30

the Configuration Model, for a particular stub leaving node i, there are 2L-1 possible stubs to
which it might connect; and, for any node j, the number of possible stubs from node j to which a
stub from node i could connect is k j . Therefore, the probability that a stub from node i connects
kj kj
to a stub of j is p j = ≈ , if L is large. So, the expected number of full edges between
2L − 1 2L
ki k j
node i to node j, p ij , is given by: pij =
, if L is large. Now, in the actual network (assumed
2L
simple and connected), the actual number of edges from node i to node j is given by aij .
Therefore, the fraction of the edges that fall within the given groups [i.e., the modules or
communities] minus the expected fraction if edges were distributed at random" is calculated by
finding the following difference:
aij − pij aij pij
= − , for each pair of nodes i and j in the group.
2L 2L 2L

In his book, Network Science, Albert-László Barabási follows the Configuration Model. For a
given partition of a network into a set of n c communities and for each community C c in the
partition, he defines the modularity of C c (where 1 ≤ c ≤ n c ), denoted M c , and the
modularity, denoted M, of the overall network that has been so partitioned as follows:

1
Mc = ∑ ( Aij − pij )
2 L (i , j )∈Cc
(9.9)

where (i, j) means that the nodes i and j belong to the same community and where
ki k j
pi j = . (9.10)
2L

Barabási observes:

"If M c > 0, then the subgraph C c has more links than expected by chance, so it
represents a potential community. If M c = 0, then the connectivity between the N c nodes
is random, fully described by the degree distribution. Finally, if M c < 0, then the nodes
of C c do not form a community. ..."li

He adds:

"Using (9.10) we can derive a simpler form for the modularity (9.9) (Advanced topics 9.B)
2
L k 
Mc = c −  c  , (9.11)
L  2L 
where L c is the total number of links within the community C c and k c is the total degree of the
31

nodes in this community."lii

Finally, Barabási extends these ideas to the whole network:

"[W]e define the partition's modularity by summing (9.11) over all n c communities:
nc  L  k 2 
M = ∑  c −  c  . (9.12)"liii
 L  2 L  
c =1 

*****

We apply Barabási's definition of the modularity of a community structure in his formula (9.9) in
the next example.

Example 5.2 Suppose that G is the ring network shown in Figure 5.3. What is the modularity
score, M, for the community structure indicated below where nodes 1, 4, and 3 are in the Blue
community and nodes 2 and 5 are in the Red community?

Using Barabási's formula in (9.9), we will calculate the modularity scores for the red community
and the blue community and add them up to obtain the overall modularity, M, of the whole
network. Then we will compute the network modularity M by using Barabási's formula (9.12).

Figure 5.3 Find the modularity score, M, for this community structure

2*2
We note that because every node has degree 2, then pij = , for all i and j, 1 ≤ i, j ≤ 5.
2*5
Next, we find the adjacency matrix for G:

node 1 2 3 4 5
 1 0 1 1 0 0 

 2 1 0 0 0 1
[ A ij ] =  .
 3 1 0 0 1 0
 4 0 0 1 0 1
 
 5 0 1 0 1 0 
32

To make the blue and red communities visible, we reorder the rows and columns of the
adjacency matrix as follows:

node 1 3 4 2 5
 1 0 1 0 1 0
 
 3 1 0 1 0 0
[ A ij ] =  
 4 0 1 0 0 1
 2 1 0 0 0 1
 
 5 0 0 1 1 0 

Now we compute the modularity for the blue and red communities.

To calculate the modularity of the blue community whose nodes are {1,3, 4}, go through the
terms in the adjacency matrix [ Aij ] which correspond to the blue node pairs; those terms are
highlighted in blue. There are 9 blue node pairs, so there are there are 9 terms, Aij − pij , that are
multiplied by 1 and included in the modularity score. Note that for node pairs where not both
nodes are blue, the difference Aij − pij is multiplied by 0 and so is not included in the
modularity. So, the modularity score for the blue community is:

1
M Blue = {[( A11 − p11 ) + ( A13 − p13 ) + ( A14 − p14 )] + [( A31 − p31 ) + ( A33 − p33 ) + ( A34 − p34 )] +
2*5
[( A41 − p41 ) + ( A43 − p43 ) + ( A44 − p44 )]}
1
= {[(0 − p11 ) + (1 − p13 ) + (0 − p14 )] + [(1 − p31 ) + (0 − p33 ) + (1 − p34 )] +
2*5
[(0 − p41 ) + (1 − p43 ) + (0 − p44 )]}
1 2* 2 2* 2 2* 2 2* 2 2* 2 2* 2
= {[(0 − ) + (1 − ) + (0 − )] + [(1 − ) + (0 − ) + (1 − )] +
2*5 2*5 2*5 2*5 2*5 2*5 2*5
2* 2 2* 2 2* 2
[(0 − ) + (1 − ) + (0 − )]}
2* 5 2*5 2* 5

1 4 4
= {5(0 − ) + 4(1 − )}
2* 5 10 10

= 0.04

Next, we calculate M Re d .
33

1
M Re d = {[( A22 − p22 ) + ( A25 − p25 )] + ( A52 − p52 )] + [( A55 − p55 )]}
2 *5
1
= {[(0 − p22 ) + (1 − p25 )] + (1 − p52 )] + [(0 − p55 )]}
2 *5
1 2*2 2*2 2*2 2*2
= {[(0 − ) + (1 − )] + (1 − )] + [(0 − )]}
2 *5 2*5 2*5 2*5 2*5
1 2* 2 2*2
= {2(0 − ) + 2(1 − )}
2 *5 2 *5 2 *5
1 2* 2 2*2
= {2(0 − ) + 2(1 − )}
2 *5 2 *5 2 *5
1 2* 2 2* 2
= {2(0 − ) + 2(1 − )}
2*5 2*5 2 *5
= 0.04

Therefore, the total modularity of the community structure is calculated as follows:


M = M Blue + M Red = 0.04 + 0.04 = 0.08.

2
L k 
Next, we recalculate M Re d and M Blue by using the formula (9.11), M c = c −  c  .
L  2L 
We observe that L = 5, LBlue = 2 , kBlue = 6, LRed = 1 , kRed = 4. So:

2 2
LRed  k Re d  1  4 
M Red = −  = −  = 0.04, and
L  2 L  5  2*5 
2 2
LBlue  k Blue  2  6 
M Blue = −  = −  = 0.04.
L  2L  5  2*5 

Thus, M = M Blue + M Red = 0.04 + 0.04 = 0.08.

Activity 5.2 Give the modularity scores for the community structures shown in Figure 5.4.

Figure 5.4 Comparing two community structuresliv


34

Answers For the left-hand community structure: M = 5/14 = 0.357


For the right-hand community structure: M = 6/49 = 0.122

6. Community Detection with Hierarchical Clustering Algorithms

As we have noted in this module, no consensus has yet developed around a formal definition of
community. Accordingly, a variety of community detection methods have been developed (e.g.,
clique-based, graph partitioning, and modularity maximizationlv detection methods) in order to
meet the assumptions of different definitions. It is important to note that the brute-force approach
of finding all possible partitions of a network and choosing the one that best fits a particular
definition of community is not feasible as it is an NP-hard problem. Using brute force to solve
even the simplest community-finding problem, graph bisection, is NP-hard.lvi As Barabási states,
"We therefore need algorithms that can identify communities without inspecting all partitions."lvii

In this section we will describe some of the most widely used algorithms, called hierarchical
clustering methods, whose goal is to maximize modularity.

6.1 Community Detection with Agglomerative Hierarchical Clustering Algorithms

Agglomerative clustering or Agglomerative Nesting (AGNES) is a clustering algorithm that


constructs a hierarchy of clusters from the nodes of a given network. Initially each node is
assigned to its own cluster. Then two nearest clusters are merged into the same cluster. This
process is repeated until only one cluster is left.

As the word nearest indicates, AGNES is based on distance. Carrying out hierarchical clustering
requires two decisions: (1) the choice of a distance function, d, for measuring the distance d(i, j)
between any two nodes, i and j, in the network; and (2) the decision of how to extend the
distance function, d, to define the distance d(A,B) between any two clusters of nodes, A and B.

There are several options for (1), defining the distance between two nodes. These options access
the topological structure of the network through the degrees of the nodes, k i (1 ≤ i ≤ N), the
1 N
means of the rows in the adjacent matrix, < A i > = ∑ A , and the number of neighbors that
N k =1 ik
two nodes i and j share in common, nij = ∑ Aik Akj (the dot product of rows A i and A j in the
k
adjacency matrix); a neighbor of a node is a node to which it is adjacent. Three commonly used
ways to measure the distance between two nodes i and j in an undirected network are as follows:
35

• the (squared) Euclidean distance, dij , defined by:


N
(4) dij = ∑ (Aik − Ajk )
k =1
2

• the cosine similarity measure, σ ij , which uses the dot product of the vectors formed by
rows i and j in the adjacency matrix
N

∑ Aik Akj
(5) σ ij = k =1
N N

∑ A ik ∑ A jk
k =1
2

k =1
2

• the standard Pearson correlation coefficient, rij , which use the variance of rows i and j
in the adjacency matrix:
N

cov (Ai , Aj ) ∑ (Aik − < Ai >)(Ajk − < Aj >)


k =1
(6) rij = =
σ iσ j N N

∑ (Aik − < Ai >)2


k =1
∑ (Ajk − < Aj >)
k =1
2

where cov( Ai , Aj ) denotes the covariance of rows i and j in the adjacency matrix

Very important, each of these metrics can be expressed succinctly in a network-centric way:

• squared Euclidean distance:


(7) dij = ki + k j − 2nij
• the cosine similarity measure:
nij nij
(8) σ ij = =
ki k j ki k j
• the standard Pearson correlation coefficient:
kk
nij − i j
(9) rij = N .
ki 2 k j2
ki − kj −
N N

Once a distance metric, d, for nodes has been selected, we define the N x N matrix D = [d(i , j)]
to be the distance matrix for the network.

For step (2), i.e., selecting a way to extend d to define the distance d(A,B) between two clusters,
A and B, there are several options, such as:
• minimum or single-linkage clustering, where d(A, B) = min{d(a,b) | a ∈ A, b ∈ B};
36

• maximum or complete-linkage clustering, where d(A, B) = max{d(a,b) | a ∈ A, b ∈ B};


1
• mean or average linkage clustering, where d(A, B) = ∑ ∑ d(a,b).
| A||B|a∈A b∈B

The sequential steps in hierarchical clustering can be displayed in a nested diagram that shows
the sequential merging of clusters, as well as by a tree called a dendrogram.

We illustrate hierarchical clustering in Example 6.1.

Example 6.1 Consider a graph we saw earlier in Figure 3.2:

We will carry out single-linkage, hierarchical clustering using the squared Euclidean distance
formula (7), dij = ki + k j − 2nij , for measuring the distance between nodes i and j.

The degrees of the five nodes are: k 1 = 2, k 2 = 4, k 3 = 2, k 4 = 3, k 5 = 3.

Next, we arrange the values n ij (i.e., the number of neighbors of nodes i and j) are as follows:

2 1 1 2 1
1 4 1 2 2 

[nij ] = 1 1 2 1 2
 
2 2 1 3 1
1 2 2 1 3

Applying the distance formula dij = ki + k j − 2nij , for all i and j, 1 ≤ i , j ≤ 5,


we get the following distance matrix :
37

0 4 2 1 3
4 0 4 3 3 

D = [d ] =  2 4 0 3 1 .
ij
 
1 3 3 0 4
 3 3 1 4 0 

Now we carry out hierarchical clustering.

i. Initially we have 5 clusters: 1, 2, 3, 4, 5. The distance between each pair of clusters is


given by the distance matrix. The minimum distance between clusters is 1.

clusters 1 2 3 4 5
 1 0 4 2 1 3
 
 2 4 0 4 3 3
 
 3 2 4 0 3 1
 4 1 3 3 0 4
 
 5 3 3 1 4 0 

ii. We merge cluster 1 and cluster 4 into one cluster {1,4} at a distance of 1; we also merge cluster
3 and cluster 5 into one cluster {3,5} at a distance of one. The minimum distance between the
clusters is 2.

clusters {1,4} {3,5} 2 


 {1,4} 0 2 3
 
 {3,5} 2 0 3
 
 2 3 3 0

iii. We merge cluster {1,4} and cluster {3,5} into cluster {{1,4}, {3,5}} at a distance of 2.
The minimum distance between clusters is 3.

 clusters {{1,4} ,{3,5}} 2


 
{{1,4} {3,5}} 0 3
 2 3 0 

iv. We merge cluster {{1,4},{3,5}} and cluster 2 into one cluster {{{1,4},{3,5}},2} at a
distance of 2. The last cluster contains all the objects, thus the agglomerative algorithm
is concluded.
38

 clusters {{{1,4} ,{3,5}} ,2} 


 
{{{1,4} ,{3,5}} ,2} 0 

In Figure 6.1 we show two visualizations of the sequential steps that we carried out in
hierarchical clustering. In the dendrogram, the nodes of the networks are arranged along the x-
axis. The y-axis scale represents distances between the clusters.

Figure 6.1 Visualizations of agglomerative hierarchical clustering

Note. One way to subdivide a network into clusters is to choose a height along the y-axis of the
dendrogram, and draw a horizontal line at that height (see the red line in the above dendrogram),
and count how many vertical lines it crosses. Each vertical line corresponds to a grouping of
clusters, and the members of that grouping are represented by the branches of the dendrogram
that lie below the line. In the above dendrogram, the red horizontal line intersects two vertical
lines. The associated community structure is the set of two clusters: A = {1,4,3,5} and B = {2}.
*****

In general, how can we use the dendrogram to determine how many clusters best fit our data?

Because the distance metric we choose is intended to intended to measure the similarity of nodes,
"The best choice of the no. of clusters is the no. of vertical lines in the dendrogram cut by a
horizontal line that can transverse the maximum distance vertically without intersecting a
cluster."lviii

To illustrate, in the dendrogram shown in Figure 6.2 below, the distance between the two red
lines indicates the difference between the clusters; the longer the line, the greater the difference.
39

These lines cut four vertical lines. For each vertical line, the nodes beneath it form a community.
So, there are four communities: {9,23,17,6,11,3,15,7}, {14,19,16,24,22},
{1,13,12,5,21,4,10,20}, {2,18,8,25}.

Figure 6.2 Using the dendrogram to determine community structurelix

Activity 6.1 Consider the following distance matrix for a network of 5 nodes 1, 2, 3, 4, 5.

distance 1 2 3 4 5 
 1 0 0.10 0.90 0.35 0.80 

 2 0.10 0 0.30 0.40 0.50 
 
 3 0.90 0.30 0 0.60 0.70 
 4 0.35 0.40 0.60 0 0.20 
 
 5 0.80 0.50 0.70 0.20 0 

Using minimum distance (single-linkage), give the dendrogram and nested diagram that result
from carrying out agglomerative hierarchical clustering.

Answer We begin by assigning each of the five nodes to its individual cluster.

Step 1 The minimum distance in the table is the distance between nodes 1 and 2.
This distance is 0.10. So we group 1 and 2.
40

cluster 1 2 3 4 5 

0
.
1
0
 1 0 0.90 0.35 0.80 
 
 2 0.10 0 0.30 0.40 0.50 
 
 3 0.90 0.30 0 0.60 0.70 
 4 0.35 0.40 0.60 0 0.20 
 
 5 0.80 0.50 0.70 0.20 0 

Step 2 The minimum distance in the new table is the distance between nodes 4 and 5.
The distance if 0.20. We now group 4 and 5.

cluster 1,2 3 4 5 
 1,2 0 0.30 0.35 0.50 

 3 0.30 0 0.60 0.70 
0
.
2
0

 
 4 0.35 0.60 0 
 5 0.50 0.70 0.20 0 

Step 3 The minimum distance in the new table is the distance between {1,2} and 3.
The distance is 0.30. We now cluster 3 with {1,2}.

cluster 1,2 3 4,5 


0
.
3
0

 1,2 0 0.35
 
 3 0.30 0 0.60 
 
 4,5 0.35 0.60 0 

Step 4 The minimum distance in the new table is the distance between {{1,2},3} and {4,5}.
The distance is 0.35. We now cluster {{1,2},3} with {4,5}.

cluster 1,2,3 4,5 


0
.
3
5

 1,2,3 0 
0
.
3
5

 
 4,5 0 

Step 5 The minimum distance in the new table is 0. We have one cluster. The process is
complete.
 cluster {{{1,2},3},{4,5}} 
{{{1,2},3},{4,5}} 0 
 
41

Here is another way to summarize our steps:


i. Initially we have 5 clusters: 1,2,3,4,5.
ii. We merge cluster 1 and cluster 2 into cluster {1,2} at distance 0.10.
iii. We merge cluster 4 and cluster
clu 5 into cluster {4,5} at distance 0.20.
iv. Wee merge cluster 3 and cluster {1,2} into cluster {{1,2},3} at distance 0.30.
v. We merge cluster {{1,2},3} with cluster {4,5} into cluster {{1,2},3}, [4,5}} at a
distance of 0.35.
vi. The last cluster contains all the object, thus the agglomerative algorithm is
concluded.

The dendrogram and nested diagrams below visualize our results. The dendrogram is drawn
based on the distances to merge the clusters.

6.2 Community Detection with Divisive Hierarchical Clustering Algorithms

Agglomerative nesting algorithms are 'bottom-up.' They start with each node of the network
assigned to its own cluster.. Next, two nearest clusters are merged into the same cluster.
c This
process is iterated until only one cluster is left.

Divisive analysis clustering (DIANA) algorithms are the reverse of agglomerative nesting
algorithms. They are 'top-down
down.' Initially all the nodes in a network are placed in a single cluster.
Next, based
ased on the intuition that edges with the highest betweenness values are 'bridges' between
communities, groups are subdivided into two group recursively until each node of the network
forms a separate cluster of its own.

Example 6.2 Using the edge betweenness


betwee of each edge, we will apply the divisive hierarchical
clustering algorithm to the graph below which appeared earlier in Figure 4.5.
4.5
42

In Example 4.5,, we calculated the edge betweenness scores for every edge in the network in
Figure 4.5 (displayed above) and summarized them in the following betweenness table (the
j, for all i and j, where 1 ≤ i,
number in the (i,j)th position is the edge betweenness for the edge ii-j,
j ≤ 6):

nodes 1 2 3 4 5 6
 1 0 0 1 0 0 0
 
 2 0 0 1 0 0 0
 
 3 1 1 0 9 0 0
 4 0 0 9 0 4 4
 
 5 0 0 0 4 0 1
 6 0 0 0 4 1 0 

Step 1. Since the edge 3-44 has the highest betweenness score,
score we remove it from the network.
Here is the new network that results:

Step 2. We re-compute
compute the betweenness score
scoress for the edges in the new network
network.

Here is our analysis:


• Edge 1-3: shortest path for each pair of nodes between [1] and [2, 3] (no other); each pair
with traffic = 1; total: 2 x 1 = 2
43

o similar for edge 2-3


• Edge 4-5: shortest path for each pair of nodes between [4] and [5] (no other); each pair
with traffic = 1; total: 1 x 1 = 1
o similar for edges 4-6, 5-6

Here is the betweenness table for the new network:

nodes 1 2 3 64 5
 1 0 0 2 0 
0 0

 2 0 0 2 00 0
 
 3 2 2 0 00 0
 4 0 0 0 0 1 1
 
 5 0 0 0 1 0 1
 6 0 0 0 1 1 0 

Edges 1-3 and 2-3 have the highest betweenness score, so we remove them from the network.
Here is the network that results:

Step 3. We re-compute the betweenness scores for the edges in the new network.

Here is our analysis:


• Edge 4-5: shortest path for each pair of nodes between [4] and [5] (no other); each pair
with traffic = 1; total: 1 x 1 = 1
o similar for edges 4-6, 5-6
44

Here is the betweenness table for the new network:

nodes 1 2 3 4 5 6
 1 0 0 0 0 0 0 

 2 0 0 0 0 0 0
 
 3 0 0 0 0 0 0
 4 0 0 0 0 1 1
 
 5 0 0 0 1 0 1
 6 0 0 0 1 1 0 

Edges 4-5, 4-6, 5-6 have the highest betweenness score, so we remove them from the network.
Here is the network that results:

The network has been completely subdivided and the algorithm is complete.

Activity 6.2. Consider the following network, called the "Line" network:

a. On the above network, label each edge with its betweenness.


b. Remove the edge(s) of highest betweenness and draw the new network
that results.
c. Calculate the modularity, M, of the community structure for the network in part b.
c =1 2
L k 
Use the formulas M = ∑ M c , where M c = c −  c  , for each community c.
c =1 L  2L 
d. Label each edge in your network in part b with its edge betweenness.
e. Remove the edge(s) of highest betweenness and in the space below draw the new
network that results.
45

f. Calculate the modularity, M, of the community structure for the network in part e.
nc 2
Lc  kc 
Again, use A-LBarabási's formulas M = ∑ M c , where M c = −  (1 ≤ c ≤ n c ).
c =1 L  2L 
Answers

part a

part b

part c

As indicated below, there are 3 communities: red = {G, F, E}, blue = {C, B, A}, and green = {D}.

2 2
LRed k  2  5  23
M Red = −  Re d  = −   = , and
L  2L  6  2 * 6  144
2 2
LBlue k  2  5  23
M Blue = −  Blue  = −   = , and
L  2L  6  2*6  144
2
−4
2
L k  0  2 
M Green = Green −  Green  = −   =
L  2L  6  2*6  144

Therefore, the modularity of this community structure is:


c =3
 23  4 7
M = ∑ Mc = 2 − = = 0.29.
c =1  144  144 24

part d
46

part e

part f

As indicated below, there are 7 communities: {G}, {F}, {E}, {C}, {B},{ A}, and {D}.
Therefore, the modularity of this community structure is:

c =7  0   1  2   0   2  2  11
M = ∑ M c = 2    −    + 5   −    = − = −0.1528.
c =1  6   12    6   12   72
47

7. Technology Tutorial

7.1 Introduction to Netlytic


We have explored examples of community detection using small data sets. We want to use the
same tools for large data sets, however doing this by hand will take a tremendous amount of
time. Lucky for us, computers are exceptional at processing large amounts of data quickly. We
will be learning to use the free online software, Netlytic, to assist us with large data sets.

7.2 Creating a Netlytic Account


Here is a free video created by Netlytic that is a tutorial of how to setup an account with Netlytic.
https://youtu.be/DgpMDDU2BGg
To create an account from scratch:
1. Using an internet browser, type in Netlytic.org in the address bar.
2. Click Register
3. Type in a valid email address in the email address field and the confirm email address
field
4. Create a password and type your password into both the password field and the confirm
password field.
5. Verify the words or phrase in the text box.
6. Review the terms of user agreement.
7. Check the agreement box.
8. Click the register button.

To create an account from an existing Gmail or Yahoo email account:


1. Using an internet browser, type in Netlytic.org in the address bar.
2. Click on the appropriate button for either Gmail or Yahoo.
3. Enter in your email address and password.

Note: Netlytic does not access or store your personal email password. Netlytic simply asks
Gmail or Yahoo to verify your identity.

7.3 Basic Activity to See What Netlytic Can Create


We are going to walk through a process of using Netlytic to process data received from Twitter
and visualizing the data into communities based on retweets. This particular data set focuses on
tweets using #Ebola during the ebola outbreak in West Africa on November 12th, 2014. You can
either review Netlytic’s tutorial here, or follow the steps listed below.
1. Download the Ebola data set from here.
2. Using an internet browser, type in Netlytic.org in the address bar.
48

3. Sign in using your login information from before.


4. Click on the “New Dataset” button at the top of the page.
5. Click on the tab titled “Text File.”
6. The first field is the title that you will give this data set in Netlytic. Type Ebola into this
field.
7. To import the file we downloaded earlier click on the button called “Choose File” located
under the title field.
8. A new window will open where you will need to find the file you downloaded earlier. On
most computers there is a download folder that the file will have been downloaded into.
The file is named “Ebola_1K_Sample_Nov12_2014 - Sheet 1.csv” Click on this file and
click the “Open” button.
9. Click on the import button.
10. The next screen should show that you have successfully uploaded 1000 records into
Netlytic. Click on the “next step” button.
11. Click on the “4. Network Analysis” link
12. In the first box titled “NAME NETWORK / WHO MENTIONS WHOM” we need to
change the dataset type to Twitter, since the data was collected from Twitter. Click on the
drop down menu under “select data set” and click on Twitter.
13. Click on the button that says “1000 Remaining Posts”
14. Netlytic will now be processing your data which may take a few seconds. Once this is
done you will have new options at the top of this box. Click on the Visualize button.
15. After a few seconds a new window will open with a visual representation of the data from
Twitter.

The nodes are tweets that were made. The edges represent when a tweet was retweeted. The
different colors represent different communities retweeting different messages. You might notice
a large community in the middle of the data with the center node being “rt_com”. This is a news
organization, RT, that had tweeted a story about nurses in the United States protesting about a
lack of protection in place to safeguard against ebola. Here is a link to the original tweet. Here is
a link to the news article on RT’s website.

7.4 Discussion Questions


Have the students think about and write responses to these questions on their own. After the
students have had a sufficient amount of time to grapple with some of the questions have the
students discuss their thoughts in small groups. Once the small groups have been able to discuss
the questions for some time open the discussion up to the class as a whole. Identify common
themes that arise from the discussion.
1. Describe the sentiment of the tweets regarding this topic. How would you categorize the
public’s view on the issue?
2. Netlytic has identified 5 communities that are named Cluster 1 through Cluster 5. Do you
agree that these clusters are communities?
3. Why do you think some tweets have created communities where others have not?
49

4. Is public consensus in agreement on this issue? Give some evidence as to why or why
not.
5. Think about different groups, politicians, economists, the military, etc. How could these
groups use the information presented here?

7.5 Additional Resources


The exercises and examples for this module all use data that Netlytic has already generated. If
you would like to generate your own data, or have students generate their own data, Netlytic has
tools for doing so. This is outside of the scope for this module so we refer the instructor to the
tutorials provided by Netlytic here as well as the Netlytic Twitter Documentation here.

7.6 Homework Exercises


For homework we will be repeating the exercise performed in class, but on a few different data
sets. Students should download the data sets, upload the data into Netlytic, process the data, view
the visualization, and answer the discussion questions from before about each data set. Students
should come to class prepared to discuss their thoughts on the discussion questions for the data
sets.

Data Set 1: A set of tweets from participants of a Massively Online Open Course called CCK11.
This data set was posted to Netlytic by the pLASMA research team. The data can be downloaded
here.

Data Set 2: A set of tweets regarding a conference about social media called Social Media and
Society in 2015. The data can be downloaded here.
50

8. Suggested Projects
● Twitter posts regarding the discovery of a new particle with properties of the Higgs
boson. The data is from Twitter between July 1st and July 7th, 2012. Students can
analyze the spreading of the news of the discovery. Here is a link to the data set and some
metadata information. The data is hosted by Jure Leskovec at Stanford University and
collected by [1].
● A map of Twitter followership and friendship. This data is a directed graph showing
which Twitter users follow which users. This project can head in numerous different
directions regarding communities, such as where are people getting news via Twitter,
who is popular on Twitter, or which celebrities have the greatest following on Twitter.
Here is a link to the data set and some metadata information. The data is hosted by ASU
and collected by [5].
● We have presented two hierarchical clustering algorithms in this module, the
agglomerative and divisive algorithms. There are numerous other clustering algorithms.
Students could create a project where there independently learn about and present a
different clustering algorithm. This should include a discussion of the advantages and
disadvantages between the new algorithm and the ones presented in this module. The
project should also include worked out examples on some sample networks.
● The community detection algorithms presented in this module use a modularity method
for detecting communities in networks. There are other methods for detecting
communities in networks, for example a betweenness-based algorithm for community
detection. A student could independently explore a different method for detecting
communities. This should include a discussion of the advantages and disadvantages
between the new algorithm and the ones presented in this module. The project should
also include worked out examples on some sample networks.
● In the agglomerative hierarchical clustering method described in this module we have
defined one possible way of defining distance. There are other possible ways to define
distance. A student project could be exploring complete-linkage clustering and average
linkage clustering as different ways to define distance in this method. The student should
consider an example network and compare the communities found using the three
different distances. Included in this should be an analysis of the differences in the
communities found. There should also be an analysis of the advantages and
disadvantages of each of the different distances used.

UCI offers a repository of data sets of numerous different networks here. If you are looking for
more project ideas you can examine the different data sets in the repository.
51

9. Endnotes

i
https://www.kickstarter.com/about?ref=nav
ii
https://www.kickstarter.com/help/faq/kickstarter%20basics
iii
https://www.kickstarter.com/projects/1152620801/swale
iv
http://www.forbes.com/sites/jacobmorgan/2014/05/13/simple-explanation-internet-things-that-anyone-can-
understand/#5de624a86828
v
PowerPoint presentation, "Army Cyber Institute Overview", by MAJ Natalie Vanatta, Deputy Chief of Research,
Army Cybersecurity Institute, United States Military Academy, West Point. Delivered Thursday, June 16, 2016, at
the CCICADA Reconnect Workshop at West Point.
vi
Visualizing Social Networks II: Facebook.
http://bellm.org/blog/2011/06/08/visualizing-social-networks-ii-facebook/
vii
Ibid
viii
p.357. Networks: An Introduction. M .E. J. Newman. Oxford University Press (2010)
ix
Ibid
x
Ibid
xi
Ibid
xii
http://www.maths.tcd.ie/~mnl/store/AdamicGlance2004a.pdf
xiii
"The Political Blogosphere and the 2004 U. S. Presidential Election: Divided They Blog." Lada Adamic and
Natalie Glance. http://www.maths.tcd.ie/~mnl/store/AdamicGlance2004a.pdf
xiv
Figure 1, Ibid
xv
Chapter 9, Network Science. Albert-László Barabási. Cambridge University Press.(2016)
xvi
https://www.quantamagazine.org/wp-content/uploads/2015/01/HDN_Quanta.jpg
xvii
http://www.pnas.org/content/104/21/8685/F2.expansion.html
xviii
An Information Flow Model for Conflict and Fission in Small Groups. Wayne W. Zachary. Journal of
Anthropological Research, Vol. 33, No. 4 (Winter, 1977), pp. 452-473
xix
CCICADA module: Who Is Really in Charge. M. Bautista, D. Beers, M. Goodloe. http://ccicada.org/wp-
content/uploads/2015/06/WhoIsReallyInChargeModule_Bautista_Beers_Goodloe_final-October-29-2015.pdf
(2015)
xx
https://networkdata.ics.uci.edu/data.php?id=105
xxi
http://ifisc.uib-csic.es/~jramasco/fig/zach_layout3.jpg
xxii
http://stanford.edu/class/archive/cs/cs106b/cs106b.1156/images/graph-figure-2.png
xxiii
https://aleadeum.files.wordpress.com/2016/07/lesmis.png?w=700
This network is based on the data set from D. E. Knuth. (1993). The Stanford GraphBase: A Platform for
Combinatorial Computing, Addison-Wesley, Reading, MA
xxiv
Section 2.2, Network Science. 9, Network Science. Albert-László Barabási. Cambridge University Press.(2016).
A-L Barábasi speaks of the "wiring diagram" of a network in his book beginning in Section 2.2.
xxv
The application of social network theory to animal behaviour, Amelia Coleing, March 3, 2009
xxvi
http://biohorizons.oxfordjournals.org/content/2/1/32.full
xxvii
http://whatis.techtarget.com/definition/six-degrees-of-separation
xxviii
Slide 8. Betweenness Measures and Graph Partitioning.
https://www.ismll.uni-hildesheim.de/lehre/cmie-11w/script/lecture5.pdf
xxix
Ibid
xxx
https://en.wikipedia.org/wiki/Girvan%E2%80%93Newman_algorithm
xxxi
Slide 9. Betweenness Measures and Graph Partitioning.
https://www.ismll.uni-hildesheim.de/lehre/cmie-11w/script/lecture5.pdf
xxxii
Slide 10. Betweenness Measures and Graph Partitioning.
https://www.ismll.uni-hildesheim.de/lehre/cmie-11w/script/lecture5.pdf
xxxiii
This graph is from slide 33 of a presentation, Community Detection and Graph-based Clustering. Adapted from
Chapter 3 of the book, Community Detection and Mining in Social Media, by Lei Tang and Huan Liu. Slides
prepared by Qiang Yang. http://images.slideplayer.com/8/2309634/slides/slide_33.jpg
52

xxxiv
Ibid
xxxv
https://en.wikipedia.org/wiki/Community
xxxvi
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3088589/
xxxvii
http://www.cfinder.org/
xxxviii
Network Science. Albert-László Barabási. Cambridge University Press.(2016)
xxxix
http://mathworld.wolfram.com/CompleteGraph.html
xl
https://en.wikipedia.org/wiki/Dense_graph
xli
http://www.faculty.ucr.edu/~hanneman/nettext/C10_Centrality.html
xlii
p. 357. Networks: An Introduction. M .E. J. Newman. Oxford University Press (2010)
xliii
Characterizing the Community Structure of Complex Networks. Andrea Lancichinetti, Mikko Kivelä, Jari
Saramäki, Santo Fortunato. PLOS One.( August 12, 2010)
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0011976
xliv
Community Detection in Graphs. Santo Fortunato. (2010) https://arxiv.org/pdf/0906.0612.pdf
xlv
Ibid
xlvi
Ibid
xlvii
Figure 1: Example network showing community structure. "Communities, modules and large-scale structure in
networks." M. E. J. Newman. Nature Physics 8, 25–31. (2012).
http://www.nature.com/nphys/journal/v8/n1/full/nphys2162.html
xlviii
Ibid
xlix
https://en.wikipedia.org/wiki/Modularity_(networks)
l
Ibid
li
p.339, Ibid
lii
p. 339, Ibid
liii
Ibid
liv
Aaron Clauset. Lecture 5. Network Analysis.
http://tuvalu.santafe.edu/~aaronc/courses/5352/csci5352_2016_L5.pdf
lv
Ibid
lvi
A.-L. Barabási, Chapter 9, Network Science. Cambridge University Press. (2016)
lvii
Ibid
lviii
https://www.analyticsvidhya.com/blog/2016/11/an-introduction-to-clustering-and-different-methods-of-
clustering/
lix
Ibid

10. References

[1] M. De Domenico, A. Lima, P. Mougel and M. Musolesi. The Anatomy of a Scientific


Rumor. (Nature Open Access) Scientific Reports 3, 2980 (2013).
[2] Santo Fortunato. Community detection in graphs. Physics reports 486 (3), 75-174.
2010/2/28. North-Holland.
[https://arxiv.org/pdf/0906.0612.pdf]
[3] Community Detection for Social Computing. Powerpoint presentation. Lei Tang. Yahoo!
Labs. 2011/09/16. UC Berkeley.
[http://leitang.net/presentation/community_detection_2011_berkeley.pdf]

[4] Netlytic software 2016/17/6. [https://netlytic.org/]


53

[5] R. Zafarani and H. Liu, (2009). Social Computing Data Repository at ASU
[http://socialcomputing.asu.edu]. Tempe, AZ: Arizona State University, School of Computing,
Informatics and Decision Systems Engineering.
[6] Albert-László Barabási. Network Science. Cambridge University Press.(2016)

[7] Mark Newman. Networks: An Introduction. Oxford University Press. (2010)

You might also like