Unsupervised Learning
TS. Dinh Dong Luong
Contents
Introduction
Distance and Similarity Measurement
K-Means Clustering
Hierarchical Clustering
Density-based Clustering
2
Supervised learning vs. unsupervised
learning
Supervised learning: discover patterns in the
data that relate data attributes with a target
(class) attribute (labelled).
These patterns are then utilized to predict the
values of the target attribute in future data
instances.
Unsupervised learning: The data have no
target attribute (non lablelled).
We want to explore the data to find some intrinsic
structures in them.
3
Introduction
Data clustering concerns how to group a set of objects
based on their similarity of attributes and/or their
proximity in the vector space.
A good clustering method will produce high quality
clusters with
high intra-class similarity: cohesive within clusters
low inter-class similarity: distinctive between clusters
Clustering is often called an unsupervised learning
task
Clustering is often considered synonymous with
unsupervised learning. In fact, association rule mining is
also unsupervised
4
An illustration
The data set has three natural groups of data points,
i.e., 3 natural clusters.
5
Stages in clustering
6
Aspects of clustering
A clustering algorithm
Partitional clustering
Hierarchical clustering
…
A distance (similarity, or dissimilarity) function
Clustering quality
Inter-clusters distance maximized
Intra-clusters distance minimized
The quality of a clustering result depends on the
algorithm, the distance function, and the
application.
7
Similarity and Distance Measurement
Key to clustering. “similarity” and
“dissimilarity” can also commonly used terms.
There are numerous distance functions for
Different types of data
Numeric data
Nominal data
8
Distance functions for
numeric attributes
9
Pearson Correlation
10
Trend Similarity (Pearson Correlation)
11
Similarity Measurements
12
Similarity Measurements
13
Similarity Measurements
14
Similarity Measurements
15
Similarity Measurements
16
Similarity Measurements
17
Euclidean distance and Manhattan distance
Euclidean distance
dist (xi , x j ) ( xi1 x j1 ) 2 ( xi 2 x j 2 ) 2 ... ( xir x jr ) 2
Manhattan distance
dist (xi , x j ) | xi1 x j1 | | xi 2 x j 2 | ... | xir x jr |
Weighted Euclidean distance
dist (xi , x j ) w1 ( xi1 x j1 ) 2 w2 ( xi 2 x j 2 ) 2 ... wr ( xir x jr ) 2
18
Squared distance and Chebychev distance
Squared Euclidean distance: to place
progressively greater weight on data points
that are further apart.
dist (xi , x j ) ( xi1 x j1 ) 2 ( xi 2 x j 2 ) 2 ... ( xir x jr ) 2
Chebychev distance: one wants to define two
data points as "different" if they are different
on any one of the attributes.
dist(xi , x j ) max(| xi1 x j1 |, | xi 2 x j 2 |, ..., | xir x jr |)
19
Distance function for text documents
A text document consists of a sequence of
sentences and each sentence consists of a
sequence of words.
To simplify: a document is usually considered a
“bag” of words in document clustering.
Sequence and position of words are ignored.
A document is represented with a vector just like a
normal data point.
It is common to use similarity to compare two
documents rather than distance.
The most commonly used similarity function is the cosine
similarity. We will study this later.
20
Data standardization
In the Euclidean space, standardization of attributes
is recommended so that all attributes can have
equal impact on the computation of distances.
Consider the following pair of data points
xi: (0.1, 20) and xj: (0.9, 720).
dist ( x i , x j ) (0.9 0.1) 2 (720 20) 2 700.000457 ,
The distance is almost completely dominated by
(720-20) = 700.
Standardize attributes: to force the attributes to have
a common value range
21
How to choose a clustering algorithm
Choosing the “best” algorithm is a challenge.
Every algorithm has limitations and works well with certain
data distributions.
It is very hard, if not impossible, to know what distribution
the application data follow. The data may not fully follow
any “ideal” structure or distribution required by the
algorithms.
One also needs to decide how to standardize the data, to
choose a suitable distance function and to select other
parameter values.
22
K-means clustering
K-means is a partitional clustering algorithm
Let the set of data points (or instances) D be
{x1, x2, …, xn},
where xi = (xi1, xi2, …, xir) is a vector in a real-
valued space X Rr, and r is the number of
attributes (dimensions) in the data.
The k-means algorithm partitions the given
data into k clusters.
Each cluster has a cluster center, called centroid.
k is specified by the user
23
K-means algorithm
Given k, the k-means algorithm works as
follows:
1)Randomly choose k data points (seeds) to be the
initial centroids, cluster centers
2)Assign each data point to the closest centroid
3)Re-compute the centroids using the current
cluster memberships.
4)If a convergence criterion is not met, go to 2).
24
An example
+
+
25
An example (cont …)
26
Weaknesses of k-means: Problems with
outliers
27
Weaknesses of k-means: To deal with
outliers
One method is to remove some data points in the
clustering process that are much further away from
the centroids than other data points.
To be safe, we may want to monitor these possible outliers
over a few iterations and then decide to remove them.
Another method is to perform random sampling.
Since in sampling we only choose a small subset of
the data points, the chance of selecting an outlier is
very small.
Assign the rest of the data points to the clusters by
distance or similarity comparison, or classification
28
Weaknesses of k-means (cont …)
The algorithm is sensitive to initial seeds.
29
Weaknesses of k-means (cont …)
If we use different seeds: good results
There are some
methods to help
choose good
seeds
30
Use frequent values to represent cluster
This method is mainly for clustering of
categorical data (e.g., k-modes clustering).
Main method used in text clustering, where a
small set of frequent words in each cluster is
selected to represent the cluster.
31
Clusters of arbitrary shapes
Hyper-elliptical and hyper-
spherical clusters are usually
easy to represent, using their
centroid together with spreads.
Irregular shape clusters are hard
to represent. They may not be
useful in some applications.
Using centroids are not suitable
(upper figure) in general
K-means clusters may be more
useful (lower figure), e.g., for making
2 size T-shirts.
32
Hierarchical Clustering
Produce a nested sequence of clusters, a tree.
33
Types of hierarchical clustering
Agglomerative (bottom up) clustering: It builds the
tree from the bottom level, and
merges the most similar (or nearest) pair of clusters
stops when all the data points are merged into a single
cluster (i.e., the root cluster).
Divisive (top down) clustering: It starts with all data
points in one cluster, the root.
Splits the root into a set of child clusters. Each child cluster
is recursively divided further
stops when only singleton clusters of individual data points
remain, i.e., each cluster with only a single point
34
Agglomerative clustering
It is more popular then divisive methods.
At the beginning, each data point forms a
cluster (also called a node).
Merge nodes/clusters that have the least
distance.
Go on merging
Eventually all nodes belong to one cluster
35
Agglomerative clustering algorithm
36
An example: working of the algorithm
37
Measuring the distance of two clusters
A few ways to measure distances of two
clusters.
Results in different variations of the
algorithm.
Single link
Complete link
Average link
Centroids
…
38
Single link method
The distance between
two clusters is the
distance between two
closest data points in
the two clusters, one
data point from each
cluster.
It can find arbitrarily
shaped clusters, but Two natural clusters are
It may cause the
split into two
undesirable “chain effect”
by noisy points
39
Complete link method
The distance between two clusters is the distance
of two furthest data points in the two clusters.
It is sensitive to outliers because they are far
away
40
Average link and centroid methods
Average link: A compromise
between
the sensitivity of complete-link
clustering to outliers and
the tendency of single-link
clustering to form long chains
that do not correspond to the
intuitive notion of clusters as
compact, spherical objects.
In this method, the distance
between two clusters is the
average distance of all pair-
wise distances between the
data points in two clusters.
41
Average link and centroid methods
Centroid method: In this method, the distance
between two clusters is the distance between
their centroids
42
Density-based Approaches
Why Density-Based Clustering methods?
Discover clusters of arbitrary shape.
Clusters – Dense regions of objects separated by
regions of low density
DBSCAN – the first density based clustering
OPTICS – density based cluster-ordering
DENCLUE – a general density-based
description of cluster and clustering
Density-Based Clustering
Basic Idea:
Clusters are dense regions in the
data space, separated by regions of
lower object density
Why Density-Based Clustering?
Results of a
k-medoid
algorithm for
k=4
Density Based Clustering: Basic Concept
Intuition for the formalization of the basic idea
For any point in a cluster, the local point density
around that point has to exceed some threshold
The set of points from one cluster is spatially
connected
Local point density at a point p defined by two
parameters
e – radius for the neighborhood of point p:
Ne (p) := {q in data set D | dist(p, q) e}
MinPts – minimum number of points in the given
neighbourhood N(p)
e-Neighborhood
e-Neighborhood – Objects within a radius of e
from an object.
N e ( p) : {q | d ( p, q) e }
“High density” - ε-Neighborhood of an object
contains at least MinPts of objects.
ε-Neighborhood of p
ε ε
ε-Neighborhood of q
Density of p is “high” (MinPts = 4)
q p Density of q is “low” (MinPts = 4)
Core, Border & Outlier
Given e and MinPts,
Outlier categorize the objects into
three exclusive groups.
Border
A point is a core point if it has more
than a specified number of points
Core (MinPts) within Eps These are
points that are at the interior of a
cluster.
A border point has fewer than
e = 1unit, MinPts = 5 MinPts within Eps, but is in the
neighborhood of a core point.
A noise point is any point that
is not a core point nor a border
point.
DBSCAN Algorithm
Input: The data set D
Parameter: e, MinPts
For each object p in D
if p is a core object and not processed then
C = retrieve all objects density-reachable from p
mark all objects in C as processed
report C as a cluster
else mark p as outlier
end if
End For
DBScan Algorithm
DBSCAN: The Algorithm
Arbitrary select a point p
Retrieve all points density-reachable from p wrt Eps and MinPts.
If p is a core point, a cluster is formed.
If p is a border point, no points are density-reachable from p and
DBSCAN visits the next point of the database.
Continue the process until all of the points have been processed.
DBSCAN Algorithm: Example
Parameter
e = 2 cm
MinPts = 3
for each o D do
if o is not yet classified then
if o is a core-object then
collect all objects density-reachable from o
and assign them to a new cluster.
else
assign o to NOISE
DBSCAN Algorithm: Example
Parameter
e = 2 cm
MinPts = 3
for each o D do
if o is not yet classified then
if o is a core-object then
collect all objects density-reachable from o
and assign them to a new cluster.
else
assign o to NOISE
DBSCAN Algorithm: Example
Parameter
e = 2 cm
MinPts = 3
for each o D do
if o is not yet classified then
if o is a core-object then
collect all objects density-reachable from o
and assign them to a new cluster.
else
assign o to NOISE
Example
Original Points Point types: core,
border and outliers
e = 10, MinPts = 4
When DBSCAN Works Well
Original Points Clusters
• Resistant to Noise
• Can handle clusters of different shapes and sizes
When DBSCAN Does NOT Work Well
(MinPts=4, Eps=9.92).
Original Points
• Cannot handle Varying
densities
• sensitive to parameters
(MinPts=4, Eps=9.75)
DBSCAN: Sensitive to Parameters
Density Based Clustering: Discussion
Advantages
Clusters can have arbitrary shape and size
Number of clusters is determined automatically
Can separate clusters from surrounding noise
Can be supported by spatial index structures
Disadvantages
Input parameters may be difficult to determine
In some situations very sensitive to input
parameter setting
Denclue: Technical Essence
Influence functions: (influence of y on x, is a
user given constant)
Square : f ysquare(x) = 0, if dist(x,y) > ,
1, otherwise
Guassian:
d ( x, y )2
f y
Gaussian ( x) e 2 2
Density Function
Density Definition is defined as the sum of the
influence functions of all data points.
d ( x , xi ) 2
N
D
( x)
2
2
f Gaussian i 1
e
60