[go: up one dir, main page]

0% found this document useful (0 votes)
14 views77 pages

Chapter4 Clustering

Chapter 4 focuses on clustering, a data mining method used to group similar data objects without predefined classes. It covers various clustering methods, including partitioning, hierarchical, density-based, grid-based, and model-based approaches, as well as the quality and requirements of clustering. The chapter also discusses the types of data suitable for clustering and applications across multiple fields such as marketing, city planning, and spatial data analysis.

Uploaded by

hironf18
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views77 pages

Chapter4 Clustering

Chapter 4 focuses on clustering, a data mining method used to group similar data objects without predefined classes. It covers various clustering methods, including partitioning, hierarchical, density-based, grid-based, and model-based approaches, as well as the quality and requirements of clustering. The chapter also discusses the types of data suitable for clustering and applications across multiple fields such as marketing, city planning, and spatial data analysis.

Uploaded by

hironf18
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 77

Chapter 4

Clustering

Course Outcomes
ITC602.4: Implement the appropriate data mining methods like
classification, clustering or Frequent Pattern mining on large data sets.
Chapter 4. Clustering
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods - K-Means, K-Mediods
5. Hierarchical Methods - Agglomerative, Divisive, BIRCH
6. Density-Based Methods - DBSCAN
7. Grid-Based Methods
8. Model-Based Methods
9. Clustering High-Dimensional Data
10.Constraint-Based Clustering
11.Outlier Analysis
12.Summary Data Mining: Concepts and
04/14/21 Techniques 2
What is Cluster Analysis?
 Cluster: a collection of data objects
 Similar to one another within the same cluster
 Dissimilar to the objects in other clusters
 Cluster analysis
 Finding similarities according to the characteristics
found and grouping similar data objects into
clusters
 Unsupervised learning: no predefined classes
 Typical applications
 As a stand-alone tool to get insight into data
distribution

04/14/21As a preprocessing step for other algorithms 3
Clustering: Rich Applications and
Multidisciplinary Efforts
 Pattern Recognition
 Spatial Data Analysis
 Create thematic maps in GIS by clustering
feature spaces
 Detect spatial clusters or for other spatial mining
tasks
 Image Processing
 Economic Science (especially market research)
 WWW
 Document classification
 Cluster Weblog data to discover groups of similar
Data Mining: Concepts and
04/14/21 access patterns Techniques 4
Examples of Clustering Applications
 Marketing: Help marketers discover distinct groups in their
customer bases, and then use this knowledge to develop
targeted marketing programs
 Land use: Identification of areas of similar land use in an
earth observation database
 Insurance: Identifying groups of motor insurance policy
holders with a high average claim cost
 City-planning: Identifying groups of houses according to their
house type, value, and geographical location
 Earth-quake studies: Observed earth quake epicenters
should be clustered along continent faults
Data Mining: Concepts and
04/14/21 Techniques 5
Quality: What Is Good Clustering?

 A good clustering method will produce high quality


clusters with
 high intra-class similarity
 low inter-class similarity
 The quality of a clustering result depends on both
the similarity measure used by the method and its
implementation
 The quality of a clustering method is also measured
by its ability to discover some or all of the hidden
patterns
04/14/21 Data Mining: Concepts and Techniques 6
Measure the Quality of Clustering

 Dissimilarity/Similarity metric: Similarity is


expressed in terms of a distance function, typically
metric: d(i, j)
 There is a separate “quality” function that
measures the “goodness” of a cluster.
 The definitions of distance functions are usually
very different for interval-scaled, boolean,
categorical, ordinal ratio, and vector variables.
 Weights should be associated with different
variables based on applications and data
semantics.
 It is hard to define “similar enough”
Data Mining: Concepts and or “good
04/14/21 Techniques 7
Requirements of Clustering in Data Mining
 Scalability
 Ability to deal with different types of attributes
 Ability to handle dynamic data
 Discovery of clusters with arbitrary shape
 Minimal requirements for domain knowledge to
determine input parameters
 Able to deal with noise and outliers
 Insensitive to order of input records
 High dimensionality
 Incorporation of user-specified constraints
 Interpretability and usability
Data Mining: Concepts and
04/14/21 Techniques 8
Chapter 4. Clustering
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods - K-Means, K-Mediods
5. Hierarchical Methods - Agglomerative, Divisive, BIRCH
6. Density-Based Methods - DBSCAN
7. Grid-Based Methods
8. Model-Based Methods
9. Clustering High-Dimensional Data
10.Constraint-Based Clustering
11.Outlier Analysis
12.Summary Data Mining: Concepts and
04/14/21 Techniques 10
Data Structures

[ ]
 Data matrix
 (two modes)
x 11 .. . x 1f . .. x 1p
. .. .. . .. . . .. . ..
x i1 .. . x if . .. x ip
. .. .. . .. . . .. . ..
x n1 .. . x nf . .. x np

[ ]
 Dissimilarity matrix
0
 (one mode)
d( 2,1) 0
d( 3,1) d( 3,2) 0
: : :
d ( n,1) d( n,2) ... ... 0

Data Mining: Concepts and


04/14/21 Techniques 11
Type of data in clustering analysis

 Interval-scaled variables
 Binary variables
 Nominal, ordinal, and ratio variables
 Variables of mixed types

Data Mining: Concepts and


04/14/21 Techniques 12
Interval-valued variables

 Standardize data
 Calculate1 the mean absolute deviation:
s f = ( | x1 f −mf | + | x 2 f −m f | + . ..+ | x nf −m f | )
n

1
mf = ( x 1 f +x 2f + . . . +x nf ) .
n

where
 Calculate the standardized
x if −m f measurement (z-
z if =
sf
score)

 Using mean absolute deviation is more robust than


using standard deviation
Data Mining: Concepts and
04/14/21 Techniques 13
Similarity and Dissimilarity Between
Objects

 Distances are normally used to measure the


similarity or dissimilarity between two data objects
 Some popular ones include: Minkowski distance:
√q
d( i,j) = ( | x i 1−x j1 | q + | x i 2−x j 2 | q + ...+ | x ip −x jp | q )
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp)
are two p-dimensional data objects, and q is a
positive integer
 If q = 1, d is Manhattan distance
d( i,j) = | x i1 −x j1 | + | x i2 −x j2 | + ...+ | x ip−x jp |

Data Mining: Concepts and


04/14/21 Techniques 14
Similarity and Dissimilarity Between
Objects (Cont.)
 If q = 2, d is Euclidean distance:

d( i,j) = ( | x i 1−x j1 | 2 + | x i2 −x j2 | 2 + ...+ | x ip−x jp | 2 )
 Properties

d(i,j)  0

d(i,i) = 0

d(i,j) = d(j,i)

d(i,j)  d(i,k) + d(k,j)
 Also, one can use weighted distance, parametric
Pearson product moment correlation, or other
disimilarity measures

Data Mining: Concepts and


04/14/21 Techniques 15
Binary Variables
Object j
1 0 sum
 A contingency table for
1 a b a+b
binary data Object i
0 c d c+d
sum a+c b+d p
 Distance measure for
b+c
symmetric binary variables: d ( i,j) =
a+b+c+d
 Distance measure for b+c
asymmetric binary variables: d ( i,j) =
a+b+c
 Jaccard coefficient (similarity
measure for asymmetric a
sim Jaccard ( i,j) =
a+b+c
binary variables):
04/14/21 16
Dissimilarity between Binary
Variables
 Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N

 gender is a symmetric attribute


 the remaining attributes are asymmetric binary

let the values Y and P be set to 1, and the value N be set


to 0 0+ 1
d( jack,mary ) = = 0 . 33
2 + 0+ 1
1+ 1
d( jack,jim ) = = 0 . 67
1 + 1+ 1
1+ 2
d( jim,mary ) = = 0 . 75
1 + 1+ 2
04/14/21 17
Nominal Variables

 A generalization of the binary variable in that it can


take more than 2 states, e.g., red, yellow, blue,
green
 Method 1: Simple matching
 m: # of matches, p: total # of variables
p−m
d( i,j) =
p

 Method 2: use a large number of binary variables


 creating a new binary variable for each of the M
nominal states
Data Mining: Concepts and
04/14/21 Techniques 18
Ordinal Variables

 An ordinal variable can be discrete or continuous


 Order is important, e.g., rank
 Can be treated like interval-scaled
replace xif by their rank, rif Ɛ {1,...,Mf} 1, . . . ,M

r if
f

 map the range of each variable onto [0, 1] by


replacing i-th object in the f-th variable by
r if −1
z if =
M f −1

 compute the dissimilarity using methods for


interval-scaled variables
Data Mining: Concepts and
04/14/21 Techniques 19
Ratio-Scaled Variables
 Ratio-scaled variable: a positive measurement on a
nonlinear scale, approximately at exponential
scale, such as AeBt or Ae-Bt
 Methods:
 treat them like interval-scaled variables—not a
good choice! (why?—the scale can be distorted)
 apply logarithmic transformation and treat it as
interval-valued,
yif = log(xif)
 treat them as continuous ordinal data treat their
rank as interval-scaled
04/14/21 20
Variables of Mixed Types
 A database may contain all the six types of
variables
 symmetric binary, asymmetric binary, nominal,

ordinal, interval and ratio


 One may use a weighted Σ f=formula
p
δ
( f)
d
(tof )combine their
1 ij ij
effects d ( i,j) = p ( f)
Σ f= 1 δ ij

 f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
 f is interval-based: use the normalized distance

 f is ordinal or ratio-scaled r −1
z if = if
M f −1

compute ranks rif and
04/14/21

and treat z as interval-scaled 21
Vector Objects
 Vector objects: keywords in documents, gene
features in micro-arrays, etc.
 Broad applications: information retrieval, biologic
taxonomy, etc.
 Cosine measure

Euclidean norm of vector X (x1, x2, …..xp)


2 2 2
|| X ||= x1 + x2 + ... + x p )
 A variant: Tanimoto coefficient

Data Mining: Concepts and


04/14/21 Techniques 22
Chapter 4. Clustering
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods - K-Means, K-Mediods
5. Hierarchical Methods - Agglomerative, Divisive, BIRCH
6. Density-Based Methods - DBSCAN
7. Grid-Based Methods
8. Model-Based Methods
9. Clustering High-Dimensional Data
10.Constraint-Based Clustering
11.Outlier Analysis
12.Summary Data Mining: Concepts and
04/14/21 Techniques 24
Major Clustering Approaches
(I)
 Partitioning approach:
 Construct various partitions and then evaluate them by some
criterion, e.g., minimizing the sum of square errors
 Typical methods: k-means, k-medoids, CLARANS
 Hierarchical approach:
 Create a hierarchical decomposition of the set of data (or objects)
using some criterion
 Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON
 Density-based approach:
 Based on connectivity and density functions
 Typical methods: DBSACN, OPTICS, DenClue

Data Mining: Concepts and


04/14/21 Techniques 25
Major Clustering Approaches
(II)
 Grid-based approach:
 based on a multiple-level granularity structure
 Typical methods: STING, WaveCluster, CLIQUE
 Model-based:
 A model is hypothesized for each of the clusters and tries to find
the best fit of that model to each other
 Typical methods: EM, SOM, COBWEB
 Frequent pattern-based:
 Based on the analysis of frequent patterns
 Typical methods: pCluster
 User-guided or constraint-based:
 Clustering by considering user-specified or application-specific
constraints
04/14/21
 26
Typical Alternatives to Calculate the
Distance between Clusters
 Single link: smallest distance between an element in one
cluster and an element in the other, i.e., dis(Ki, Kj) = min(tip, tjq)
 Complete link: largest distance between an element in one
cluster and an element in the other, i.e., dis(Ki, Kj) = max(tip, tjq)
 Average: avg distance between an element in one cluster and
an element in the other, i.e., dis(Ki, Kj) = avg(tip, tjq)
 Centroid: distance between the centroids of two clusters, i.e.,
dis(Ki, Kj) = dis(Ci, Cj)
 Medoid: distance between the medoids of two clusters, i.e.,
dis(Ki, Kj) = dis(Mi, Mj)
 Medoid: one chosen, centrally located object in the cluster
04/14/21 27
Centroid, Radius and Diameter of a
Cluster (for numerical data sets)
N
 Centroid: the “middle” of a cluster Σ i= 1 ( t ip )
Cm=
N
 Radius: square root of average distance from any point of


the cluster to its centroid
Σ Ni=1 ( t ip −c m ) 2
Rm =
N
 Diameter: square root of average mean squared distance
between all pairs of points in the cluster


Σ i=N 1 Σ Ni=1 ( t ip −t iq ) 2
Dm =
N ( N −1)

04/14/21 Data Mining: Concepts and Techniques 28


Chapter 4. Clustering
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods - K-Means, K-Mediods
5. Hierarchical Methods - Agglomerative, Divisive, BIRCH
6. Density-Based Methods - DBSCAN
7. Grid-Based Methods
8. Model-Based Methods
9. Clustering High-Dimensional Data
10.Constraint-Based Clustering
11.Outlier Analysis
12.Summary Data Mining: Concepts and
04/14/21 Techniques 30
Partitioning Algorithms: Basic
Concept
 Partitioning method: Construct a partition of a database D of
n objects into a set of k clusters, s.t., min sum of squared
distance k 2
Σ m= 1 Σ t mi ∈ Km ( C m −t mi )
 Given a k, find a partition of k clusters that optimizes the
chosen partitioning criterion
 Global optimal: exhaustively enumerate all partitions
 Heuristic methods: k-means and k-medoids algorithms
 k-means (MacQueen’67): Each cluster is represented by
the center of the cluster
 k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the
objects in the cluster
Data Mining: Concepts and
04/14/21 Techniques 31
The K-Means Clustering Method

 Given k, the k-means algorithm is implemented


in four steps:
 Partition objects into k nonempty subsets
 Compute seed points as the centroids of the
clusters of the current partition (the centroid
is the center, i.e., mean point, of the cluster)
 Assign each object to the cluster with the
nearest seed point
 Go back to Step 2, stop when no more new
assignment
Data Mining: Concepts and
04/14/21 Techniques 32
9

The K-Means Clustering Method


8

 Example
5

3 10
10
9
9
2 8
8
7
7
1 6
6
5
5
0 4
4
2 3
Assign
4 5 6 7 8 Update
9
3 10
3

2 each the 2

1
objects cluster 1

0
0
0 1 2 3 4 5 6 7 8 9 10 to means 0 1 2 3 4 5 6 7 8 9 10

most
similar reassign reassign
center
K=2
Arbitrarily choose
K object as initial
5 6 7 8 9 10
cluster
5 center
6 7 8 9 10 Update
the
cluster
means

Data Mining: Concepts and


04/14/21 Techniques 33
Comments on the K-Means Method

 Strength: Relatively efficient: O(tkn), where n is # objects, k is


# clusters, and t is # iterations. Normally, k, t << n.

Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
 Comment: Often terminates at a local optimum. The global
optimum may be found using techniques such as:
deterministic annealing and genetic algorithms
 Weakness
 Applicable only when mean is defined, then what about
categorical data?
 Need to specify k, the number of clusters, in advance
 Unable to handle noisy data and outliers
 Not suitable to discover clusters with non-convex shapes
Data Mining: Concepts and
04/14/21 Techniques 34
Variations of the K-Means Method

 A few variants of the k-means which differ in


 Selection of the initial k means
 Dissimilarity calculations
 Strategies to calculate cluster means
 Handling categorical data: k-modes (Huang’98)
 Replacing means of clusters with modes
 Using new dissimilarity measures to deal with categorical
objects
 Using a frequency-based method to update modes of clusters
 A mixture of categorical and numerical data: k-prototype
Data Mining: Concepts and
04/14/21method Techniques 35
What Is the Problem of the K-Means
Method?

 The k-means algorithm is sensitive to outliers !


 Since an object with an extremely large value may
substantially distort the distribution of the data.
 K-Medoids: Instead of taking the mean value of the object in
a cluster as a reference point, medoids can be used, which is
the most centrally located object in a cluster.

10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

Data Mining: Concepts and


04/14/21 Techniques 36
The K-Medoids Clustering Method

 Find representative objects, called medoids, in clusters


 PAM (Partitioning Around Medoids, 1987)
 starts from an initial set of medoids and iteratively replaces
one of the medoids by one of the non-medoids if it
improves the total distance of the resulting clustering
 PAM works effectively for small data sets, but does not
scale well for large data sets
 CLARA (Kaufmann & Rousseeuw, 1990)
 CLARANS (Ng & Han, 1994): Randomized sampling
 Focusing + spatial data structure (Ester et al., 1995)

Data Mining: Concepts and


04/14/21 Techniques 37
A Typical K-Medoids Algorithm (PAM)
10 10

Total Cost = 20
9 9
10

8
8 8
7

6
Arbitrary Assign
5
choose k each
object as remaini
7 7
4

3
initial ng
6
2
medoids
6 object
to
1

0
0 1 2 3 4 5 6 7 8 9 10
nearest
medoid Randomly select a
5 5

K=2 s nonmedoid
4 4 Total Cost = 26 object,Oramdom
10 10

3 Do loop 3
9

8
Compute
9

8
Swapping 7 total cost 7

Until no O and
2
6
of 6

Oramdom
2

change
5 5

4
swapping 4

If quality is
3 3

1 1 2 2

improved. 1 1

0 0

0 0 Data
0 1 2
Mining:
3 4 5
Concepts
6 7 8 9
and
10 0 1 2 3 4 5 6 7 8 9 10

1 04/14/21 2 3 4 1 5 2 Techniques
6 3 7 4 8 5 9 6 10 7 38
PAM (Partitioning Around Medoids)
(1987)
 PAM (Kaufman and Rousseeuw, 1987), built in Splus
 Use real object to represent the cluster
 Select k representative objects arbitrarily
 For each pair of non-selected object h and selected
object i, calculate the total swapping cost TCih
 For each pair of i and h,
 If TCih < 0, i is replaced by h

Then assign each non-selected object to the
most similar representative object
 repeat steps 2-3 until there is no change
Data Mining: Concepts and
04/14/21 Techniques 39
PAM Clustering: Total swapping cost
10
TCih=jCjih
9

8
t j
t
7 j
h
10

i h
6 i
9

4
Cjih = d(j, h) - d(j, i) Cjih = 0
7

h
2
j
i
5

i h j
1
t 4

t
0

1 2 3 4 5 6 7 8 9
3
4 5 6 7 8 9 10

04/14/21 Cjih = d(j, t) - d(j, i) 2


C = d(j, h) - d(j, t)
Data Mining: Concepts and
Techniques jih 40
What Is the Problem with PAM?

 Pam is more robust than k-means in the presence


of noise and outliers because a medoid is less
influenced by outliers or other extreme values
than a mean
 Pam works efficiently for small data sets but does
not scale well for large data sets.
 O(k(n-k)2 ) for each iteration
where n is # of data,k is # of clusters
 Sampling based method,
CLARA(Clustering LARge Applications)
Data Mining: Concepts and
04/14/21 Techniques 41
CLARA (Clustering Large Applications)
(1990)
 CLARA (Kaufmann and Rousseeuw in 1990)
 Built in statistical analysis packages, such as S+
 It draws multiple samples of the data set, applies
PAM on each sample, and gives the best clustering
as the output
 Strength: deals with larger data sets than PAM
 Weakness:
 Efficiency depends on the sample size
 A good clustering based on samples will not
necessarily represent a good clustering of the
04/14/21
whole data set if the sample is biased 42
CLARANS (“Randomized” CLARA)
(1994)
 CLARANS (A Clustering Algorithm based on
Randomized Search) (Ng and Han’94)
 CLARANS draws sample of neighbors dynamically
 The clustering process can be presented as searching
a graph where every node is a potential solution, that
is, a set of k medoids
 If the local optimum is found, CLARANS starts with
new randomly selected node in search for a new local
optimum
 It is more efficient and scalable than both PAM and CLARA
 Focusing techniques and spatial access structures
may further improve its performance (Ester et al.’95)
04/14/21 43
Chapter 4. Clustering
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods - K-Means, K-Mediods
5. Hierarchical Methods - Agglomerative, Divisive, BIRCH
6. Density-Based Methods - DBSCAN
7. Grid-Based Methods
8. Model-Based Methods
9. Clustering High-Dimensional Data
10.Constraint-Based Clustering
11.Outlier Analysis
12.Summary Data Mining: Concepts and
04/14/21 Techniques 44
Hierarchical Clustering
 Use distance matrix as clustering criteria. This
method does not require the number of clusters k
as an input, but needs a termination condition

Step Step Step Step Step


0 1 2 3 4
agglomerative
(AGNES)
a ab
b abcde
c
cde
d
de
e
divisive
Step Step Step Step Step (DIANA)
04/14/21 4 3 2 1 0 45
AGNES (Agglomerative Nesting)
 Introduced in Kaufmann and Rousseeuw (1990)
 Implemented in statistical analysis packages, e.g.,
Splus
 Use the Single-Link method and the dissimilarity
matrix.
 Merge nodes that have the least dissimilarity
10 10

Go on in a non-descending fashion
9 9


8 8

 Eventually all nodes belong to the same cluster


7 7

6 6

5 5

4 4

3 3

2
Data Mining: Concepts and
2

04/14/21 Techniques 46
Dendrogram: Shows How the Clusters are Merged

Decompose data objects into a several levels of nested


partitioning (tree of clusters), called a dendrogram.

A clustering of the data objects is obtained by cutting the


dendrogram at the desired level, then each connected
component forms a cluster.

Data Mining: Concepts and


04/14/21 Techniques 47
Distance between Clusters
 Single link: smallest distance between an element in one
cluster and an element in the other, i.e., dis(Ki, Kj) = min(tip, tjq)
 Complete link: largest distance between an element in one
cluster and an element in the other, i.e., dis(Ki, Kj) = max(tip, tjq)
 Average: avg distance between an element in one cluster and
an element in the other, i.e., dis(Ki, Kj) = avg(tip, tjq)
 Centroid: distance between the centroids of two clusters, i.e.,
dis(Ki, Kj) = dis(Ci, Cj)
 Medoid: distance between the medoids of two clusters, i.e.,
dis(Ki, Kj) = dis(Mi, Mj)
 Medoid: one chosen, centrally located object in the cluster
04/14/21 48
DIANA (Divisive Analysis)

 Introduced in Kaufmann and Rousseeuw (1990)


 Implemented in statistical analysis packages, e.g.,
Splus 10
10

 Inverse order of AGNES


9
9

Eventually each node forms a cluster on its own


8


8

7
7

6
6

5
5

4
4

3
3

Data Mining: Concepts and


04/14/21 2 Techniques
2
49
Recent Hierarchical Clustering
Methods
 Major weakness of agglomerative clustering
methods
 do not scale well: time complexity of at least
O(n2), where n is the number of total objects
 can never undo what was done previously
 Integration of hierarchical with distance-based
clustering
 BIRCH (1996): uses CF-tree and incrementally
adjusts the quality of sub-clusters
 ROCK (1999): clustering categorical data by
neighbor and link analysis
Data Mining: Concepts and
04/14/21
 Techniques 50
BIRCH (1996)
 Birch: Balanced Iterative Reducing and Clustering using
Hierarchies (Zhang, Ramakrishnan & Livny, SIGMOD’96)
 Designed for clustering large amount of numerical data
by integration of hierarchical clustering (at the initial
microclustering stage) and other clustering methods (at
later macroclustering stage).
 Scales linearly: finds a good clustering with a single scan
and improves the quality with a few additional scans
 Weakness: handles only numeric data, and sensitive to
the order of the data record.
 It overcomes 2 difficulties of agglomerative clustering
methods 1) scalability 2) the inability to undo what
was done in the previous step
Data Mining: Concepts and
04/14/21 Techniques 51
BIRCH
 BIRCH introduces 2 concepts, to summarize cluster
representations
Clustering feature (CF) and Clustering Feature tree (CF-
tree)
 These structures help to achieve good speed and
scalability in large databases and also make it effective
for incremental and dynamic clustering of incoming
objects
 Incrementally construct a CF (Clustering Feature) tree, a
hierarchical data structure for multiphase clustering
 Phase 1: scan DB to build an initial in-memory CF tree
(a multi-level compression of the data that tries to
preserve the inherent clustering structure of the data)
 Phase 2: use an arbitrary
Data Mining:clustering
Concepts and algorithm to
04/14/21 cluster the leaf nodes of the CF-tree
Techniques 52
Centroid, Radius and Diameter of a
Cluster (for numerical data sets)
N
 Centroid: the “middle” of a cluster Σ i= 1 ( t ip )
Cm=
N
 Radius: square root of average distance from any point of


the cluster to its centroid
Σ Ni=1 ( t ip −c m ) 2
Rm =
N
 Diameter: square root of average mean squared distance
between all pairs of points in the cluster


Σ i=N 1 Σ Ni=1 ( t ip −t iq ) 2
Dm =
N ( N −1)

04/14/21 Data Mining: Concepts and Techniques 53


Clustering Feature Vector in
BIRCH

Data Mining: Concepts and


04/14/21 Techniques 54
Clustering Feature Vector in
BIRCH

Clustering Feature: CF = (N, LS, SS)


N: Number of data points
10

LS: Ni=1=Xi 9

SS: Ni=1=Xi2 8
CF = (5, (16,30),(54,190))
7

6
(3,4)
(2,6)
5

(4,5)
(4,7)
4

(3,8)
3

Data Mining: Concepts and


04/14/21 1 Techniques 55
CF-Tree in BIRCH
 Clustering feature:
 summary of the statistics for a given subcluster: the 0-th,
1st and 2nd moments of the subcluster from the statistical
point of view.
 registers crucial measurements for computing cluster and
utilizes storage efficiently
A CF tree is a height-balanced tree that stores the clustering
features for a hierarchical clustering
 A nonleaf node in a tree has descendants or “children”
 The nonleaf nodes store sums of the CFs of their children
 A CF tree has two parameters
 Branching factor: specify the maximum number of children.
 threshold: max diameter of sub-clusters stored at the leaf
04/14/21nodes 56
The CF Tree Structure
Root
B=7 CF1 CF2 CF3 CF6

L=6 child1 child2 child3 child6

Non-leaf node
CF1 CF2 CF3 CF5
child1 child2 child3 child5

Leaf node Leaf node


prev CF1 CF2 CF6 next prev CF1 CF2 CF4 next

Data Mining: Concepts and


04/14/21 Techniques 57
Chapter 4. Clustering
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods - K-Means, K-Mediods
5. Hierarchical Methods - Agglomerative, Divisive, BIRCH
6. Density-Based Methods - DBSCAN
7. Grid-Based Methods
8. Model-Based Methods
9. Clustering High-Dimensional Data
10.Constraint-Based Clustering
11.Outlier Analysis
12.Summary Data Mining: Concepts and
04/14/21 Techniques 58
Density-Based Clustering Methods
 Clustering based on density (local cluster criterion),
such as density-connected points

Major features:

Discover clusters of arbitrary shape

Handle noise

One scan

Need density parameters as termination
condition
 Several interesting studies:
 DBSCAN: Ester, et al. (KDD’96)

 OPTICS: Ankerst, et al (SIGMOD’99).

 DENCLUE: Hinneburg & D. Keim (KDD’98)

 CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-


Data Mining: Concepts and
04/14/21 based) Techniques 59
Density-Based Clustering: Basic
Concepts
 Two parameters:
 Eps: Maximum radius of the neighbourhood
 MinPts: Minimum number of points in an Eps-
neighbourhood of that point
 NEps(p): {q belongs to D | dist(p,q) <= Eps}
 Directly density-reachable: A point p is directly
density-reachable from a point q w.r.t. Eps, MinPts if


p belongs to NEps(q) p MinPts = 5
q
 core point condition: Eps = 1 cm

|NEps (q)| >= MinPts


Data Mining: Concepts and
04/14/21 Techniques 60
Density-Reachable and Density-Connected
 Density-reachable:
 A point p is density-reachable p
from a point q w.r.t. Eps, p1
MinPts if there is a chain of q
points p1, …, pn, p1 = q, pn = p
such that pi+1 is directly
density-reachable from pi
 Density-connected p q
 A point p is density-connected o
to a point q w.r.t. Eps, MinPts
if there is a point o such that
both, p and q are Data
density-
Mining: Concepts and
04/14/21 Techniques 61
DBSCAN: Density Based Spatial
Clustering of Applications with Noise
 Relies on a density-based notion of cluster: A
cluster is defined as a maximal set of density-
connected points
 Discovers clusters of arbitrary shape in spatial
databases with noise
Outlier

Border
Eps = 1cm
Core MinPts = 5

Data Mining: Concepts and


04/14/21 Techniques 62
DBSCAN: The Algorithm

 Arbitrary select a point p


 Retrieve all points density-reachable from p w.r.t.
Eps and MinPts.
 If p is a core point, a cluster is formed.
 If p is a border point, no points are density-
reachable from p and DBSCAN visits the next
point of the database.
 Continue the process until all of the points have
been processed.
Data Mining: Concepts and
04/14/21 Techniques 63
DBSCAN: Sensitive to
Parameters

Data Mining: Concepts and


04/14/21 Techniques 64
CHAMELEON (Clustering Complex
Objects)

Data Mining: Concepts and


04/14/21 Techniques 65
OPTICS: A Cluster-Ordering Method
(1999)

 OPTICS: Ordering Points To Identify the Clustering


Structure
 Ankerst, Breunig, Kriegel, and Sander

(SIGMOD’99)
 Produces a special order of the database wrt its

density-based clustering structure


 This cluster-ordering contains info equiv to the

density-based clusterings corresponding to a


broad range of parameter settings
 Good for both automatic and interactive cluster

analysis, including finding intrinsic clustering


structure
 Can be represented graphically or using
Data Mining: Concepts and
04/14/21 Techniques 66
OPTICS: Some Extension from
DBSCAN

Index-based:

k = number of dimensions

N = 20

p = 75% D

M = N(1-p) = 5
 Complexity: O(kN2)
 Core Distance p1

 Reachability Distance o
p2 o
Max (core-distance (o), d (o, p))
MinPts = 5
r(p1, o) = 2.8cm. r(p2,o) =Data
4cm Mining: Concepts and
04/14/21 Techniques  = 3 cm 67
Chapter 4. Clustering
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods - K-Means, K-Mediods
5. Hierarchical Methods - Agglomerative, Divisive, BIRCH
6. Density-Based Methods - DBSCAN
7. Grid-Based Methods
8. Model-Based Methods
9. Clustering High-Dimensional Data
10.Constraint-Based Clustering
11.Outlier Analysis
12.Summary Data Mining: Concepts and
04/14/21 Techniques 68
Outlier Analysis
 What are outliers?
 The set of objects are considerably dissimilar

from the remainder of the data


 Outlier can be caused by an error or may be

result of inherent data variability.


 Outlier may be of particular interest
Ex: In fraud detection, outlier may indicate fraudulent
activity.
 Outlier detection and analysis – Outlier Mining
 Applications:

 Credit card fraud detection

 Telecom fraud detection

 Medical analysis Data Mining: Concepts and


04/14/21 Techniques 69
Outlier Mining
 Can be defined as
 Given a set of n data points and k, expected

number of outliers, find top k objects that are


dissimilar, exceptional or inconsistent w.r.t
remaining data
 Outlier mining problem viewed as 2 sub-problems

 define what data can be considered as

inconsistent in given dataset


 find an efficient method to mine the outliers so

defined
 Categorization of approaches for outlier detection:

Statistical approach, distance –based approach,


density-based
04/14/21
local outlier
Data approach
Mining: Concepts
Techniques
and & deviation- 70
Outlier Discovery:
Statistical
Approaches

 Assume a model underlying distribution that


generates data set (e.g. normal distribution)
 Use discordancy tests depending on

 data distribution

 distribution parameter (e.g., mean, variance)

 number of expected outliers

 Drawbacks

 most tests are for single attribute

 In many cases, data distribution may not be

04/14/21 known Data Mining: Concepts and


Techniques 71
Outlier Discovery: Distance-Based
Approach

 Introduced to counter the main limitations


imposed by statistical methods
 We need multi-dimensional analysis without

knowing data distribution


 Distance-based outlier: A DB(p, D)-outlier is an
object O in a dataset T such that at least a fraction
p of the objects in T lies at a distance greater than
D from O
 Algorithms for mining distance-based outliers
 Index-based algorithm

 Nested-loop algorithm

 Cell-based algorithm
Data Mining: Concepts and
04/14/21 Techniques 72
Density-Based Local
Outlier Detection
 Distance-based outlier
detection is based on global
distance distribution
 It encounters difficulties to

identify outliers if data is not


uniformly distributed  Local outlier factor

 Ex. C contains 400 loosely


(LOF)
1  Assume outlier is not

distributed points, C2 has crisp


100 tightly condensed  Each point has a LOF

points, 2 outlier points o1, o2


 Distance-based method

cannot identify o2 as an
outlier
 Need the concept of local
Data Mining: Concepts and
outlier
04/14/21 Techniques 73
Outlier Discovery: Deviation-Based
Approach

 Identifies outliers by examining the main


characteristics of objects in a group
 Objects that “deviate” from this description are
considered outliers
 Sequential exception technique
 simulates the way in which humans can
distinguish unusual objects from among a
series of supposedly like objects
 OLAP data cube technique
 uses data cubes to identify regions of
anomalies in large multidimensional data
Data Mining: Concepts and
04/14/21 Techniques 74
Chapter 7. Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
6. Density-Based Methods
7. Grid-Based Methods
8. Model-Based Methods
9. Clustering High-Dimensional Data
10.Constraint-Based Clustering
11.Outlier Analysis
12.Summary Data Mining: Concepts and
04/14/21 Techniques 75
Summary
 Cluster analysis groups objects based on their
similarity and has wide applications
 Measure of similarity can be computed for various
types of data
 Clustering algorithms can be categorized into
partitioning methods, hierarchical methods, density-
based methods, grid-based methods, and model-
based methods
 Computer-based Outlier analysis methods typically
follow either a statistical-distribution based, a
distance-based, a density-based or a deviation-based
approach
Data Mining: Concepts and
04/14/21 Techniques 76
Problems and Challenges

 Considerable progress has been made in scalable


clustering methods
 Partitioning: k-means, k-medoids, CLARANS
 Hierarchical: BIRCH, ROCK, CHAMELEON
 Density-based: DBSCAN, OPTICS, DenClue
 Grid-based: STING, WaveCluster, CLIQUE
 Model-based: EM, Cobweb, SOM
 Frequent pattern-based: pCluster
 Constraint-based: COD, constrained-clustering
 Current clustering techniques do not address all
the requirements adequately, still an active area
Data Mining: Concepts and
of research
04/14/21 Techniques 77
References (1)
 R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering
of high dimensional data for data mining applications. SIGMOD'98
 M. R. Anderberg. Cluster Analysis for Applications. Academic Press, 1973.
 M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points to identify
the clustering structure, SIGMOD’99.
 P. Arabie, L. J. Hubert, and G. De Soete. Clustering and Classification. World Scientific,
1996
 Beil F., Ester M., Xu X.: "Frequent Term-Based Text Clustering", KDD'02
 M. M. Breunig, H.-P. Kriegel, R. Ng, J. Sander. LOF: Identifying Density-Based Local
Outliers. SIGMOD 2000.
 M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering
clusters in large spatial databases. KDD'96.
 M. Ester, H.-P. Kriegel, and X. Xu. Knowledge discovery in large spatial databases:
Focusing techniques for efficient class identification. SSD'95.
 D. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine
Learning, 2:139-172, 1987.
 D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An approach
Data Mining: Concepts and
based on dynamic systems. VLDB’98. Techniques
04/14/21 78
References (2)
 V. Ganti, J. Gehrke, R. Ramakrishan. CACTUS Clustering Categorical Data Using
Summaries. KDD'99.
 D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An approach
based on dynamic systems. In Proc. VLDB’98.
 S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for large
databases. SIGMOD'98.
 S. Guha, R. Rastogi, and K. Shim.
ROCK: A robust clustering algorithm for categorical attributes. In ICDE'99, pp. 512-521,
Sydney, Australia, March 1999.
 A. Hinneburg, D.l A. Keim: An Efficient Approach to Clustering in Large Multimedia
Databases with Noise. KDD’98.
 A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Printice Hall, 1988.
 G. Karypis, E.-H. Han, and V. Kumar.
CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling. COMPUTER,
32(8): 68-75, 1999.
 L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster
Analysis. John Wiley & Sons, 1990.
 E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets.
VLDB’98.
 G. J. McLachlan and K.E. Bkasford. Mixture Models: Inference and Applications to
Data Mining: Concepts and
Clustering. John Wiley and Sons, 1988.
04/14/21 Techniques 79
References (3)
 L. Parsons, E. Haque and H. Liu,
Subspace Clustering for High Dimensional Data: A Review , SIGKDD Explorations, 6(1),
June 2004
 E. Schikuta. Grid clustering: An efficient hierarchical clustering method for very large
data sets. Proc. 1996 Int. Conf. on Pattern Recognition,.
 G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multi-resolution
clustering approach for very large spatial databases. VLDB’98.
 A. K. H. Tung, J. Han, L. V. S. Lakshmanan, and R. T. Ng.
Constraint-Based Clustering in Large Databases, ICDT'01.
 A. K. H. Tung, J. Hou, and J. Han. Spatial Clustering in the Presence of Obstacles ,
ICDE'01
 H. Wang, W. Wang, J. Yang, and P.S. Yu.
Clustering by pattern similarity in large data sets, SIGMOD’ 02.
 W. Wang, Yang, R. Muntz, STING: A Statistical Information grid Approach to Spatial
Data Mining, VLDB’97.
 T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : an efficient data clustering method
for very large databases. SIGMOD'96.
Data Mining: Concepts and
04/14/21 Techniques 80

You might also like