Data Clustering -- Introduction
Ref: Han and Kamber book.
1
Chapter 10. Cluster Analysis: Basic
Concepts and Methods
Cluster Analysis: Basic Concepts
Partitioning Methods
Hierarchical Methods
Evaluation of Clustering
Summary
2
What is Cluster Analysis?
Cluster: A collection of data objects
similar (or related) to one another within the same group
dissimilar (or unrelated) to the objects in other groups
Cluster analysis
Finding similarities between data according to the
characteristics found in the data and grouping similar
data objects into clusters
Unsupervised learning: no predefined classes
Typical applications
As a stand-alone tool to get insight into data distribution
As a preprocessing step for other algorithms
3
Clustering: Application Examples
Biology: taxonomy of living things: kingdom, phylum, class, order,
family, genus and species
Information retrieval: document clustering
Land use: Identification of areas of similar land use in an earth
observation database
Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
City-planning: Identifying groups of houses according to their house
type, value, and geographical location
Climate: understanding earth climate, find patterns of atmospheric
and ocean
4
Considerations for Cluster Analysis
Partitioning criteria
Single level vs. hierarchical partitioning (often, multi-level
hierarchical partitioning is desirable)
Separation of clusters
Hard (e.g., one customer belongs to only one region) vs. Soft
(e.g., one document may belong to more than one class)
Similarity measure
Distance-based (e.g., Euclidian) vs. connectivity-based (e.g.,
density)
Clustering space
Full space (often when low dimensional) vs. subspaces (often in
high-dimensional clustering)
5
Clustering methods:
6
Clustering methods:
7
Major Clustering Approaches
Partitioning approach:
Construct various partitions and then evaluate them by some
criterion
Typical methods: k-means, k-medoids, CLARANS
Hierarchical approach:
Create a hierarchical decomposition of the set of data (or objects)
using some criterion
Typical methods: Diana, Agnes, BIRCH, CAMELEON
Density-based approach:
Based on connectivity and density functions
Typical methods: DBSACN, OPTICS, DenClue
Grid-based approach:
based on a multiple-level granularity structure
Typical methods: STING, WaveCluster, CLIQUE
8
Chapter 10. Cluster Analysis: Basic
Concepts and Methods
Cluster Analysis: Basic Concepts
Partitioning Methods
Hierarchical Methods
Evaluation of Clustering
Summary
9
Partitioning Algorithms: Basic Concept
Partitioning method: Partitioning a database D of n objects into a set
of k clusters, such that the sum of squared distances is minimized
(where ci is the centroid or medoid of cluster Ci)
Global optimal: exhaustively enumerate all partitions
Heuristic methods: k-means and k-medoids algorithms
10
The K-Means Clustering Method
Given k, the k-means algorithm is implemented in
four steps:
Depends on centroids (mean)
Start with a k block partition
Refine the partition based on the criterion
Stop when no more correction is possible.
11
What Is the Problem of the K-Means Method?
The k-means algorithm is sensitive to outliers !
Since an object with an extremely large value may substantially
distort the distribution of the data
K-Medoids: Instead of taking the mean value of the object in a
cluster as a reference point, medoids can be used, which is the most
centrally located object in a cluster
Some times centroids doesnot make sense. It need not be part of the
data. Eg: Let the data consist of binary vectors, centroids are not
binary vectors.
12
But,
Finding medoid is costly.
Centroid can be found in linear time.
Medoid can take quadratic time.
13
But,
Finding medoid is costly.
Centroid can be found in linear time.
Medoid can take quadratic time.
Partitioning around medoids (PAM) is a costly
method.
We have approximations.
14
But,
Finding medoid is costly.
Centroid can be found in linear time.
Medoid can take quadratic time.
Partitioning around medoids (PAM) is a costly
method.
We have approximations.
Finding centroid itself might be difficult in some
situations (kernel K-Means).
15
Chapter 10. Cluster Analysis: Basic
Concepts and Methods
Cluster Analysis: Basic Concepts
Partitioning Methods
Hierarchical Methods
Evaluation of Clustering
Summary
16
Hierarchical Clustering
Use distance matrix as clustering criteria. This method
does not require the number of clusters k as an input,
but needs a termination condition.
Step 0 Step 1 Step 2 Step 3 Step 4 agglomerative
(AGNES)
a
ab
b
abcde
c
cde
d
de
e
divisive
(DIANA)
Step 4 Step 3 Step 2 Step 1 Step 0
17
Hierarchical Agglomerative Clustering:
Linkage Methods
The single linkage method is based on minimum
distance, or the nearest neighbor rule.
The complete linkage method is based on the
maximum distance or the furthest neighbor
approach.
The average linkage method the distance
between two clusters is defined as the average of
the distances between all pairs of objects
Linkage Methods of Clustering
Single Linkage
Minimum Distance
Cluster 1 Cluster 2
Complete Linkage
Maximum
Distance
Cluster 1 Cluster 2
Average Linkage
Average Distance
Cluster 1 Cluster 2
Dendrogram
Chapter 10. Cluster Analysis: Basic
Concepts and Methods
Cluster Analysis: Basic Concepts
Partitioning Methods
Hierarchical Methods
Evaluation of Clustering
Summary
21
Determine the Number of Clusters
Empirical method
# of clusters: k ≈√n/2 for a dataset of n points, e.g., n = 200, k = 10
Other methods:
Elbow method
Density based methods
22
Measuring Clustering Quality
3 kinds of measures: External, internal and relative
23
Measuring Clustering Quality
3 kinds of measures: External, internal and relative
External: supervised, employ criteria not inherent to the dataset
Compare a clustering against prior or expert-specified
knowledge using certain clustering quality measure
24
Measuring Clustering Quality
3 kinds of measures: External, internal and relative
External: supervised, employ criteria not inherent to the dataset
Compare a clustering against prior or expert-specified
knowledge using certain clustering quality measure
Internal: unsupervised, criteria derived from data itself
Evaluate the goodness of a clustering by considering how
well the clusters are separated, and how compact the
clusters are, e.g., Silhouette coefficient
25
Measuring Clustering Quality
3 kinds of measures: External, internal and relative
External: supervised, employ criteria not inherent to the dataset
Compare a clustering against prior or expert-specified
knowledge using certain clustering quality measure
Internal: unsupervised, criteria derived from data itself
Evaluate the goodness of a clustering by considering how
well the clusters are separated, and how compact the
clusters are, e.g., Silhouette coefficient
Relative: directly compare different clusterings, usually those
obtained via different parameter settings for the same algorithm
26
Chapter 10. Cluster Analysis: Basic
Concepts and Methods
Cluster Analysis: Basic Concepts
Partitioning Methods
Hierarchical Methods
Evaluation of Clustering
Summary
27
Summary
Cluster analysis groups objects based on their similarity and has
wide applications
Clustering algorithms can be categorized into partitioning methods,
hierarchical methods, density-based methods, grid-based methods,
and model-based methods
K-means and K-medoids algorithms are popular partitioning-based
clustering algorithms
Birch and Chameleon are interesting hierarchical clustering algorithms,
and there are also probabilistic hierarchical clustering algorithms
DBSCAN, OPTICS, and DENCLU are interesting density-based
algorithms
STING and CLIQUE are grid-based methods, where CLIQUE is also a
subspace clustering algorithm
Quality of clustering results can be evaluated in various ways
28
Important point
Are we solving a solvable problem?
An impossibility theorem on clustering (Kleinberg)
Possibility arguments (Ackerman)
29