0% found this document useful (0 votes)

20 views29 pages

Intro to Data Clustering Methods

Uploaded by

sundarkonduru0

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views29 pages

Intro to Data Clustering Methods

Uploaded by

sundarkonduru0

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

Data Clustering -- Introduction

Ref: Han and Kamber book.

1
Chapter 10. Cluster Analysis: Basic
Concepts and Methods

 Cluster Analysis: Basic Concepts

 Partitioning Methods

 Hierarchical Methods

 Evaluation of Clustering

 Summary

2
What is Cluster Analysis?
 Cluster: A collection of data objects
 similar (or related) to one another within the same group

 dissimilar (or unrelated) to the objects in other groups

 Cluster analysis
 Finding similarities between data according to the

characteristics found in the data and grouping similar

data objects into clusters
 Unsupervised learning: no predefined classes
 Typical applications
 As a stand-alone tool to get insight into data distribution

 As a preprocessing step for other algorithms

3
Clustering: Application Examples
 Biology: taxonomy of living things: kingdom, phylum, class, order,
family, genus and species
 Information retrieval: document clustering
 Land use: Identification of areas of similar land use in an earth
observation database
 Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
 City-planning: Identifying groups of houses according to their house
type, value, and geographical location
 Climate: understanding earth climate, find patterns of atmospheric
and ocean

4
Considerations for Cluster Analysis
 Partitioning criteria
 Single level vs. hierarchical partitioning (often, multi-level
hierarchical partitioning is desirable)
 Separation of clusters
 Hard (e.g., one customer belongs to only one region) vs. Soft
(e.g., one document may belong to more than one class)
 Similarity measure
 Distance-based (e.g., Euclidian) vs. connectivity-based (e.g.,
density)
 Clustering space
 Full space (often when low dimensional) vs. subspaces (often in
high-dimensional clustering)

5
Clustering methods:

6
Clustering methods:

7
Major Clustering Approaches

 Partitioning approach:
 Construct various partitions and then evaluate them by some

criterion
 Typical methods: k-means, k-medoids, CLARANS

 Hierarchical approach:
 Create a hierarchical decomposition of the set of data (or objects)

using some criterion

 Typical methods: Diana, Agnes, BIRCH, CAMELEON

 Density-based approach:
 Based on connectivity and density functions

 Typical methods: DBSACN, OPTICS, DenClue

 Grid-based approach:
 based on a multiple-level granularity structure

 Typical methods: STING, WaveCluster, CLIQUE

8
Chapter 10. Cluster Analysis: Basic
Concepts and Methods

 Cluster Analysis: Basic Concepts

 Partitioning Methods

 Hierarchical Methods

 Evaluation of Clustering

 Summary

9
Partitioning Algorithms: Basic Concept

 Partitioning method: Partitioning a database D of n objects into a set

of k clusters, such that the sum of squared distances is minimized
(where ci is the centroid or medoid of cluster Ci)
 Global optimal: exhaustively enumerate all partitions
 Heuristic methods: k-means and k-medoids algorithms

10
The K-Means Clustering Method

 Given k, the k-means algorithm is implemented in

four steps:
 Depends on centroids (mean)
 Start with a k block partition
 Refine the partition based on the criterion
 Stop when no more correction is possible.

11
What Is the Problem of the K-Means Method?

 The k-means algorithm is sensitive to outliers !

 Since an object with an extremely large value may substantially

distort the distribution of the data

 K-Medoids: Instead of taking the mean value of the object in a

cluster as a reference point, medoids can be used, which is the most
centrally located object in a cluster

 Some times centroids doesnot make sense. It need not be part of the
data. Eg: Let the data consist of binary vectors, centroids are not
binary vectors.

12
But,

 Finding medoid is costly.

 Centroid can be found in linear time.

 Medoid can take quadratic time.

13
But,

 Finding medoid is costly.

 Centroid can be found in linear time.

 Medoid can take quadratic time.

 Partitioning around medoids (PAM) is a costly

method.
 We have approximations.

14
But,

 Finding medoid is costly.

 Centroid can be found in linear time.

 Medoid can take quadratic time.

 Partitioning around medoids (PAM) is a costly

method.
 We have approximations.

 Finding centroid itself might be difficult in some

situations (kernel K-Means).

15
Chapter 10. Cluster Analysis: Basic
Concepts and Methods
 Cluster Analysis: Basic Concepts
 Partitioning Methods
 Hierarchical Methods
 Evaluation of Clustering
 Summary

16
Hierarchical Clustering
 Use distance matrix as clustering criteria. This method
does not require the number of clusters k as an input,
but needs a termination condition.
Step 0 Step 1 Step 2 Step 3 Step 4 agglomerative
(AGNES)
a
ab
b
abcde
c
cde
d
de
e
divisive
(DIANA)
Step 4 Step 3 Step 2 Step 1 Step 0

17
Hierarchical Agglomerative Clustering:
Linkage Methods
 The single linkage method is based on minimum
distance, or the nearest neighbor rule.

 The complete linkage method is based on the

maximum distance or the furthest neighbor
approach.

 The average linkage method the distance

between two clusters is defined as the average of
the distances between all pairs of objects
Linkage Methods of Clustering
Single Linkage
Minimum Distance

Cluster 1 Cluster 2
Complete Linkage
Maximum
Distance

Cluster 1 Cluster 2
Average Linkage

Average Distance
Cluster 1 Cluster 2
Dendrogram
Chapter 10. Cluster Analysis: Basic
Concepts and Methods

 Cluster Analysis: Basic Concepts

 Partitioning Methods

 Hierarchical Methods

 Evaluation of Clustering

 Summary

21
Determine the Number of Clusters
 Empirical method
 # of clusters: k ≈√n/2 for a dataset of n points, e.g., n = 200, k = 10

 Other methods:
 Elbow method

 Density based methods

22
Measuring Clustering Quality
 3 kinds of measures: External, internal and relative

23
Measuring Clustering Quality
 3 kinds of measures: External, internal and relative
 External: supervised, employ criteria not inherent to the dataset
 Compare a clustering against prior or expert-specified
knowledge using certain clustering quality measure

24
Measuring Clustering Quality
 3 kinds of measures: External, internal and relative
 External: supervised, employ criteria not inherent to the dataset
 Compare a clustering against prior or expert-specified
knowledge using certain clustering quality measure
 Internal: unsupervised, criteria derived from data itself
 Evaluate the goodness of a clustering by considering how
well the clusters are separated, and how compact the
clusters are, e.g., Silhouette coefficient

25
Measuring Clustering Quality
 3 kinds of measures: External, internal and relative
 External: supervised, employ criteria not inherent to the dataset
 Compare a clustering against prior or expert-specified
knowledge using certain clustering quality measure
 Internal: unsupervised, criteria derived from data itself
 Evaluate the goodness of a clustering by considering how
well the clusters are separated, and how compact the
clusters are, e.g., Silhouette coefficient
 Relative: directly compare different clusterings, usually those
obtained via different parameter settings for the same algorithm

26
Chapter 10. Cluster Analysis: Basic
Concepts and Methods

 Cluster Analysis: Basic Concepts

 Partitioning Methods

 Hierarchical Methods

 Evaluation of Clustering

 Summary

27
Summary
 Cluster analysis groups objects based on their similarity and has
wide applications
 Clustering algorithms can be categorized into partitioning methods,
hierarchical methods, density-based methods, grid-based methods,
and model-based methods
 K-means and K-medoids algorithms are popular partitioning-based
clustering algorithms
 Birch and Chameleon are interesting hierarchical clustering algorithms,
and there are also probabilistic hierarchical clustering algorithms
 DBSCAN, OPTICS, and DENCLU are interesting density-based
algorithms
 STING and CLIQUE are grid-based methods, where CLIQUE is also a
subspace clustering algorithm
 Quality of clustering results can be evaluated in various ways

28
Important point

 Are we solving a solvable problem?

 An impossibility theorem on clustering (Kleinberg)
 Possibility arguments (Ackerman)

Data Mining Chapter 5 Cluster Analysis
No ratings yet
Data Mining Chapter 5 Cluster Analysis
44 pages
10 Clus Basic
No ratings yet
10 Clus Basic
31 pages
10clustering - Han and Kamber
No ratings yet
10clustering - Han and Kamber
93 pages
10 Clus Basic
No ratings yet
10 Clus Basic
66 pages
Cluster Analysis Essentials
No ratings yet
Cluster Analysis Essentials
50 pages
Concepts and Techniques: - Chapter 10
No ratings yet
Concepts and Techniques: - Chapter 10
97 pages
Clustering Algorithms
No ratings yet
Clustering Algorithms
93 pages
Cluster Analysis for Researchers
No ratings yet
Cluster Analysis for Researchers
76 pages
Unit IV
No ratings yet
Unit IV
96 pages
4.1 Clustering
No ratings yet
4.1 Clustering
69 pages
10ClusBasic Editted v1
No ratings yet
10ClusBasic Editted v1
41 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
101 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
101 pages
Clustering K Means Agnes
No ratings yet
Clustering K Means Agnes
36 pages
Cluster Analysis: Basic Concepts Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Evaluation of Clustering
No ratings yet
Cluster Analysis: Basic Concepts Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Evaluation of Clustering
38 pages
10 Clus Basic
No ratings yet
10 Clus Basic
95 pages
Clustering Methods
No ratings yet
Clustering Methods
64 pages
Clustering
No ratings yet
Clustering
24 pages
Clustering Methods
No ratings yet
Clustering Methods
14 pages
Unit - 5 Cluster Analysis
No ratings yet
Unit - 5 Cluster Analysis
83 pages
Slide-08-Chapter10-Cluster Analysis Basic Concept I
No ratings yet
Slide-08-Chapter10-Cluster Analysis Basic Concept I
40 pages
05 Clustering
No ratings yet
05 Clustering
96 pages
Data Mining Clustering
No ratings yet
Data Mining Clustering
76 pages
Unit 5 DM
No ratings yet
Unit 5 DM
47 pages
Lecture 8 - Clustering
No ratings yet
Lecture 8 - Clustering
23 pages
Clustering
No ratings yet
Clustering
32 pages
Unit5 Clustering
No ratings yet
Unit5 Clustering
74 pages
Big Data Clustering Techniques
No ratings yet
Big Data Clustering Techniques
28 pages
Cluster-Analysis
No ratings yet
Cluster-Analysis
89 pages
2022 Istdm 06
No ratings yet
2022 Istdm 06
76 pages
Introduction To Cluster Analysis.
No ratings yet
Introduction To Cluster Analysis.
53 pages
Cluster Analysis
No ratings yet
Cluster Analysis
21 pages
Partitioning Methods & Hierachical Methods
No ratings yet
Partitioning Methods & Hierachical Methods
22 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
42 pages
Clustering
No ratings yet
Clustering
18 pages
DWMModule 4
No ratings yet
DWMModule 4
31 pages
Cluster
No ratings yet
Cluster
20 pages
Unit 5
No ratings yet
Unit 5
85 pages
Cluster Analysis Techniques Guide
No ratings yet
Cluster Analysis Techniques Guide
37 pages
Clustering
No ratings yet
Clustering
104 pages
11-1 Clustering Part 1 PDF
No ratings yet
11-1 Clustering Part 1 PDF
18 pages
Cluster Analysis: G Sreenivas
No ratings yet
Cluster Analysis: G Sreenivas
29 pages
Cluster Analysis: Methods and Applications
No ratings yet
Cluster Analysis: Methods and Applications
14 pages
Clustering
No ratings yet
Clustering
84 pages
Clustering Evaluation
No ratings yet
Clustering Evaluation
13 pages
2002 Spring CS525 Lecture 2
No ratings yet
2002 Spring CS525 Lecture 2
37 pages
Clustering in AI
No ratings yet
Clustering in AI
16 pages
Data Mining: I Gede Mahendra Darmawiguna
No ratings yet
Data Mining: I Gede Mahendra Darmawiguna
25 pages
IT3080 Lecture04 2023
No ratings yet
IT3080 Lecture04 2023
56 pages
DWDM Unit 3
No ratings yet
DWDM Unit 3
21 pages
8 - Clustering
No ratings yet
8 - Clustering
85 pages
Clustering
No ratings yet
Clustering
45 pages
Chapter 6
No ratings yet
Chapter 6
12 pages
Clustering Notes
No ratings yet
Clustering Notes
17 pages
Unit 4
No ratings yet
Unit 4
4 pages
Data Analytics for B.Tech Students
No ratings yet
Data Analytics for B.Tech Students
98 pages
Dmaclat4 Merged
No ratings yet
Dmaclat4 Merged
46 pages
UG BSF Clustering
No ratings yet
UG BSF Clustering
119 pages
Unit 4
No ratings yet
Unit 4
76 pages
Intro to Cluster Analysis
No ratings yet
Intro to Cluster Analysis
90 pages
Clarans Clustering
No ratings yet
Clarans Clustering
26 pages
Data Clustering & Classification Guide
No ratings yet
Data Clustering & Classification Guide
60 pages
Unit 5
No ratings yet
Unit 5
33 pages
Han 2019
No ratings yet
Han 2019
18 pages
Clustering
No ratings yet
Clustering
125 pages
23-2021 - A Comprehensive Survey of Image Segmentation
No ratings yet
23-2021 - A Comprehensive Survey of Image Segmentation
26 pages
Unit 5 Data Science
No ratings yet
Unit 5 Data Science
18 pages
3205-Article Text-23308-1-10-20240703
No ratings yet
3205-Article Text-23308-1-10-20240703
7 pages
K Metoids
No ratings yet
K Metoids
18 pages
IJEDR1702035
No ratings yet
IJEDR1702035
4 pages
K Medoids Clustering
No ratings yet
K Medoids Clustering
3 pages
Unit 4
No ratings yet
Unit 4
23 pages
CLARA CLARANS Example
No ratings yet
CLARA CLARANS Example
3 pages
Cluster MCQ
No ratings yet
Cluster MCQ
12 pages
Computer Vision (7th Sem)
No ratings yet
Computer Vision (7th Sem)
48 pages
A06-A Survey of Clustering Techniques
No ratings yet
A06-A Survey of Clustering Techniques
5 pages
Cluster Analysis Techniques Guide
No ratings yet
Cluster Analysis Techniques Guide
152 pages
Unit4 ML
No ratings yet
Unit4 ML
20 pages
K-Medoids-Clustering Method
No ratings yet
K-Medoids-Clustering Method
5 pages
Clustering Algorithms Explained
No ratings yet
Clustering Algorithms Explained
3 pages
Classification in Data Mining
No ratings yet
Classification in Data Mining
60 pages
Unit V: Distance and Rule Based Models
No ratings yet
Unit V: Distance and Rule Based Models
56 pages
Comparison Analysis of Euclidean and Gower Distanc
No ratings yet
Comparison Analysis of Euclidean and Gower Distanc
8 pages
Clustering Complexity Analysis
No ratings yet
Clustering Complexity Analysis
7 pages
Mango Clustering Algorithms
No ratings yet
Mango Clustering Algorithms
5 pages
K Means Homework
100% (1)
K Means Homework
8 pages
Data Warehousing Essentials
No ratings yet
Data Warehousing Essentials
19 pages
Analysis of Rainfall in Indonesia Using A Time Series-Based Clustering Approach
No ratings yet
Analysis of Rainfall in Indonesia Using A Time Series-Based Clustering Approach
12 pages
Clustering Techniques Overview
No ratings yet
Clustering Techniques Overview
47 pages