0% found this document useful (0 votes)

79 views20 pages

Cluster Analysis Introduction (Unit-6)

Cluster analysis is an unsupervised learning technique used to group similar objects into clusters. It works by minimizing intra-cluster distances (distances between objects within a cluster) while maximizing inter-cluster distances (distances between clusters). The quality of clusters is measured by their similarity, with high quality clusters having high intra-cluster similarity and low inter-cluster similarity. Cluster analysis is used in applications such as market research, pattern recognition, and web information discovery. Different types of data require different distance measures for determining similarity - interval variables use Euclidean distance, binary variables use Jaccard similarity, and nominal variables typically use simple matching.

Uploaded by

Kushal settulari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

79 views20 pages

Cluster Analysis Introduction (Unit-6)

Uploaded by

Kushal settulari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

UNIT-6 DATAWAREHOUSING AND MINING

CLUSTER ANALYSIS
7. CLUSTER ANALYSIS
7.1 What Is Cluster Analysis
Cluster is a collection of data objects. Similar to one another within the same
cluster. Dissimilar to the object in other clusters
Cluster Analysis
 Finding similarities between data according to the characteristics found in the data
and grouping similar data objects into clusters
 The process of grouping a set of abstract objects into classes of similar objects is
called clustering.
 Clustering is an example of unsupervised learning.
 Clustering is a from of learning by observation, rather than learning by
examples.
Inter-cluster
Intra-cluster distance are
distances are maximized
minimized

Figure 7.1 Clustering

Qualities of Good Clustering
 A good clustering method will produce high quality clusters with
1. high intra-class similarity
2. low inter –class similarity
 The quality of a clustering result depends on both the similarity measure used by
the method and its implementation.
 The quality of a clustering method is also measured by its ability to discover some
or all the hidden patterns.
Measuring the Quality of Clustering
 Similarity/Dissimilarity metric: Similarity is expresses in terms of a distance
function, typically metric: d(i,j)
 There is a separate „quality‟ function that measures the „goodness‟ of a cluster
 The definitions of distance functions are usually very different for interval-scaled,
Boolean, categorical, ordinal, ratio, and vector variables.
 Weight should be associated with different variables based on applications and
data semantics.
 It is hard to define‟s similar enough‟ or „good enough‟ the answer is typically
highly subjective.

Dr.A.VEERASWAMY SACET-CSE
UNIT-6 DATAWAREHOUSING AND MINING

Desirable Properties of a Clustering Algorithm

 Scalability Pin terms of both time and space
 Ability to deal with different data types
 Minimal requirements for domain knowledge to determine input parameters
 Able to deal with noise and outliers
 Insensitive to order of input records
 Incorporation of user-specified constraints
 Interpretability and usability
 High dimensionality
 Ability to handle dynamic data

Cluster Analysis Applications

 Market research
 Pattern recognition
 Data analysis
 Image processing
 Web for information discovery

7.2 TYPE OF DATA IN CLUSTER ANALYSIS

Main memory - based clustering algorithms uses following data structures:

Data matrix (or object-by-variable structure): This represents n objects, such as

persons, with p variables (also called measurements or attributes), such as age, height,
weight, gender, and so on. The structure is in the from of a relational table, or n-by-p
matrix (n objects ×p variables):

Dissimilarity matrix (or object-by-object structure): This stores a collection of

proximities that are available for all pairs of n objects. It is often represented by an n-
by-n table:

Where d(i,j) is the measured difference or dissimilarity between objects I and j.

Dr.A.VEERASWAMY SACET-CSE
UNIT-6 DATAWAREHOUSING AND MINING

The rows and columns of the data matrix represent different entities, while
those of the dissimilarity matrix represent the same entity. Thus, the data matrix is
often called a two-mode matrix, where as the dissimilarity matrix is called a one-
mode matrix. Many clustering algorithms operate on a dissimilarity matrix. If the data
are presented in the from of a data matrix, it can first be transformed into a
dissimilarity matrix before applying such clustering algorithms.

7.2.1 Interval-Scaled Variables

Interval-Scaled Variables describe measures that are commonly used

for computing the dissimilarity of objects described by such variables. These
measures include the Euclidean, manhattan, and Minkowski distances.

Standardize Data:

 Calculate the mean absolute deviation

Where mf = (x1f +x2f+…….+xnf)/n

 Calculate the Z-score

 Using mean absolute deviation is more robust than standard deviation. MAD is
even more robust, but outliers disappear completely.

Given the Following measurement for the variable age:

18,22,25,42,28,43,33,35,56,28

Standardize the variable by the fallowing

1. Computer the mean absolute deviation of age

2. Computer the Z-score for the first four measurements.

(a)Mean Absolute Deviation, Sf:

Where xif ,….., xnf are measurements of fx, mf is the mean value of f.

-33|+|42-33|+|28-38|+|43-33|+|33-33|+|35-
33|+|56-33|+|28-33|)

Dr.A.VEERASWAMY SACET-CSE
UNIT-6 DATAWAREHOUSING AND MINING

b) Z-score (Standardized Measurement):

X1=18

X=22

X=25

X=42

Similarity and Dissimilarity

 Distances are normally used to measure the similarity or dissimilarity between
two data objects
 Some popular distances are based on Minkowshi distance (or Lp norm)

)-( p=1: Manhattan distance (L1 norm)

+……+

)-( p=2: Euclidean distance (L2 norm)

Dr.A.VEERASWAMY SACET-CSE
UNIT-6 DATAWAREHOUSING AND MINING

Properties of a distance

 d(i,k)>=0

 d(i,k)=0

 d(i,k) = d(k,i)

 d(i,j)<= d(i,k)+ d(k,j)

Example 1: Given two objects represented by the tuples (22, 1, 42, 10) and
(20, 0, 36, 8);

A) Compute the Euclidean distance between two objects.

B) Compute the Manhattan distance between the two objects.

C) Compute the Minkowski distance between the two objects using q=3.

(a) Euclidean distance:

d(i,j) =
=
= = 6.708

(b) Manhattan distance:

d=|22-20|+|1-0|+|42-36|+|10-8|

=2+1+6+2 =11

Dr.A.VEERASWAMY SACET-CSE
UNIT-6 DATAWAREHOUSING AND MINING

(c) Minkowski Distance:

d(i,j)=

= = 6.15

Example2:

Give 5-dimensional numeric samples A=(1,0,2,5,3) and B=(2,1,0,3,-1).Find

Euclidian distance between points.

Euclidian distance d =
=
=
= 5.099

7.2.2 Binary Variables

 Contingency table for binary data

Object j

1 0 Sum
1 Q r q+r
Object j 0 S t s+t
sum q+s r+t P

Figure 7.3 A contingency table for binary variables

Binary Variables

 Distance measure for symmetric binary variables. Symmetric binary

dissimilarity

 Distance measure for asymmetric binary variables.

 Asymmetric binary dissimilarity. Negative matches dropped

 Asymmetric binary similarity

Dr.A.VEERASWAMY SACET-CSE
UNIT-6 DATAWAREHOUSING AND MINING

This is also called the Jaccard coefficient

Example:

Name Gender fever cough test-1 test-2 test-3 test-4

Jack M Y N P N N N

Mary F Y N P N P N

Jim M Y Y N N N N
. . . . . . . .
.
. . . . . . .
. . . . . . . .

 Gender is symmetric attribute

 Remaining attributes are asymmetric binary

Let values Y and P set to 1, and the value N is set to 0. Suppose that the
distance between two patients is based on the asymmetric attributes

7.2.3 Nominal or Categorical Variable

A generalization of the binary in that it can take more than 2 states

Method 1: Simple matching

Where m: # of matches p: total # number of variables

Method 2: use a large number of binary variables

Create a new binary for each of the M nominal states

Ordinal Variables

Dr.A.VEERASWAMY SACET-CSE
UNIT-6 DATAWAREHOUSING AND MINING

 An ordinal variable can be discrete or continuous

 Order is important. E.g., rank

 Can be treated like interval-scaled

 replaced xif by their rank, rif 2(1,…,Mf)

 map the range of each variable onto [0,1] by replacing i-th object in
the f-th variable by

 Compute dissimilarity using methods for interval-scaled variables

Ratio-Scaled Variables

Ratio –Scaled variable: A positive measurement on a nonlinear scale ,

approximately at exponential scale such as Aexp(βt) or Aexp(-βt)

Methods:

 Treat them as interval-scaled variables – not a good choice (it can distort data)

 Apply logarithm transformation yif =log(xif) treat them as continuous ordinal

data and their ranks as interval-valued

7.2.4 Variables of Mixed Types

 A database may contain all the six types of variables

 Weighted formula to combine their effects

Where

0, xif, xif any missing or xif = xif

= 1, otherwise

And the contribution of f to dissimilarity is computed depending on its type

Computation of

 If f is interval-based:

Where h runs over all non missing objects for variables f

Dr.A.VEERASWAMY SACET-CSE
UNIT-6 DATAWAREHOUSING AND MINING

 If f is binary or categorical:
0,xif = xif
= 1, otherwise
 If f is ordinal compute ranks rif and

And treat Zif as interval-scaled

7.2.5 Vector Objects

 Vector objects: keywords in documents, gene features in micro-arrays, etc.

 Broad applications: information retrieval, biology taxonomy, etc.
 Cosine measure

 Tanimoto coefficient

7.3 A CATEGORIZATION OF MAJOR CLUSTERING METHODS

Partitioning Methods:

 Given a database of n objects or data tuples, a partitioning method constructs k

partitions of the data, where each partition represents a cluster and k< =n.

 It classifies the data into k groups, which together satisfy the following
requirements:

1. Each group must contain at least one object.

2. Each object must belong to exactly one group.

 The general criterion of a good partitioning is that object in the same cluster are
“close” or related to each other, where object of different clusters are “far apart”
or very different.

Example: K-means, K- Mediods

1)Hierarchical Methods

 A hierarchical method creates a hierarchical decomposition of the given set of

data object i.e., the data are not partitioned into a particular cluster in a single step.
Instead, a series of partitions takes place, which may run from a single cluster
containing all objects to n clusters each containing a single object.

Two Main Types of Hierarchical Clustering:

Agglomerative: (Bottom-up Approach)

 Start with the points as individual clusters

Dr.A.VEERASWAMY SACET-CSE
UNIT-6 DATAWAREHOUSING AND MINING

 At each step, merge the cluster until only one cluster (or k clusters) left

Division: (Top-Down Approach)

 Start with one, all-inclusive cluster

 At each step, split a cluster contains a point (or there are k clusters)

To improving the quality of hierarchical clustering:

1. Perform careful analysis of object “linkages” at each hierarchical partitioning.

2. Integer hierarchical agglomeration and other approaches by first using a

hierarchical agglomerative algorithm to group objects into micro clusters, and
then performing macroclustering on the microclusters using another clustering
methods.

Example: CURE, BIRCH, CHAMELEON

2) Density-Based Methods

 Most partitioning methods cluster objects based on the distance between objects,
such methods can find only spherical-shaped clusters and encounter difficulty at
discovering clusters of arbitrary shapes.

 Other clustering methods have been developed based on the notion of density.

 Their general idea is to continue growing the given cluster as long as the density
(number of object or data points) in the “neighborhood” exceeds some threshold;
that is, for each data point within a given cluster, the neighborhood of a given
radius has to contain at least a minimum number of points.

 Such a method can be used to filter out noise (outliers) and discover clusters of
arbitrary shape.

Example: DBSCAN, OPTICS, DENCLUE

3) Grid-Based Methods

 Grid –based methods quantized the space into a finite number of cells that
from a grid structure.

 All of the clustering operations are performed on the grid structure (i.e., on the
quantized space).

 The main advantage of this approach is its fast processing time, which is
typically independent of the number of data object and dependent only on the
number of cells in each dimension in the quantized space.

Example: STING, WaveCluster, CLIQUE

Dr.A.VEERASWAMY SACET-CSE
UNIT-6 DATAWAREHOUSING AND MINING

4) Model-Based Methods:

 Model-based methods hypothesize a model for each of the clusters and find the
best fit of the data to the given model.

 A model-based algorithm may locate clusters by constructing a density function

that reflects the spatial distribution of the data points.

 It also leads to a way of automatically determining the number of clusters based

on standard statistics, taking “noise” or outliers into account and thus yielding
robust clustering methods.

Example: COBWEB, CLASSIT

7.4 PARTITIONING METHODS

Partitioning algorithms construct partitions of a data base of N objects into a

set of k clusters. The construction involves determining the optimal partition with
respect to an objective function. In general, partitioning algorithms is

 Nonhierarchical, each instance is placed in exactly one of K nonoverlapping

clusters.

 Since only one set of clusters is output, the user normally has to input the desired
number of clusters K.

 The Partitional techniques usually produce clusters by optimizing a criterion

function defined either locally or globally. A global criterion, such as the
Euclidean square-error measure, represents each cluster by a prototype or
centroid and assigns the samples to clusters according to the most similar
prototypes. A local criterion, such as the minimal mutual neighbor distance
(MND0, forms cluster by utilizing the local structure or context in the data.

The most commonly used Partitional-clustering strategy is based on the square

error criterion. The general objective is to obtain the partition that, for a fixed number
of clusters, minimizes the total square error. Suppose that the given dataset of N
samples in an n-dimensional space has been partitioned into k-clusters{c1 , c2 ,…., ck}.

Each ck has nk samples and each sample has exactly one cluster, so that

∑nk =N where k=1, …., k.

Dr.A.VEERASWAMY SACET-CSE
UNIT-6 DATAWAREHOUSING AND MINING

The mean vector mk cluster Ck is defined as the centroid of the cluster

Where Xik is the sample belonging to cluster Ck.

The square-error for cluster Ck is the sum of the squared Euclidean distance
between each sample in Ck and its centroid. This error is also called the within-cluster
variation.

The square-error for the entire clustering space containing k clusters is the
sum of the within-cluster variations:

7.4.1 Classical partitioning Methods: k-Means and k-Medoids

The most well-known and commonly used partitioning methods are k-

means, k-Medoids, and their variations.

Centroid-Based Technique: The k-Means Method

K-means is one of the simplest unsupervised learning algorithms that solve

the well known clustering problem. The k-means algorithm takes the input parameter,
k, and partitions a set of n objects into k clusters so that the resulting intracluster
similarity is high but the intercluster similarity is low. Cluster similarity is measured
in regard to the mean value of the object in a cluster, which can be viewed as the
cluster‟s centroid or center of gravity.
Given k, the k-means algorithm is implemented in four steps:
1. Partition objects into k nonempty subsets
2. Compute seed points as the centroids of the clusters of the current partition
(the centroid is the center, i.e., mean point, of the cluster)
3. Assign each object to the cluster with the nearest seed point
4. Go back to Step 2, stop when no more new assignment
Finally, this algorithm aims at minimizing an objective function, in this case
a mean squared error function is calculated as:

Where E is the sum of the square error for all objects in the data set; p is the point
space representing a given objects; and mi is the mean of cluster Ci (both p and mi are
multidimensional).

Dr.A.VEERASWAMY SACET-CSE
UNIT-6 DATAWAREHOUSING AND MINING

Figure 7.5 K-Means Clustering

Algorithm:
k-means. The k-means algorithm for partitioning, where each cluster‟s center is
represented by the mean value of the objects in the cluster.
Input
K: the number of clusters,
D: a data set containing n objects.
Output: A set of k clusters
Methods:
1. Select initial partitions with k clusters containing randomly chosen samples,
and compute the centroids of the clusters.
2. Generate a new partition by assigning each sample to the closest cluster
center.
3. Compute new cluster centers as the centroids of the cluster.
4. Repeat step2 and 3 until an optimum value of the criterion function is found or
until the cluster membership stabilizes.

Scaling the k-means Algorithm

 Identify three kinds of regions in data: regions that are compressible, regions
that must be maintained in main memory, and regions that are discardable.
 If an object is neither discardable nor compressible, then it should be retained
in main memory.
 To achieve scalability, the iterative clustering algorithm only includes the
clustering features of the compressible objects and the objects that must be
retained in main memory, thereby turning a secondary-memory based
algorithm into a main-memory based algorithm.
 Another approach is to perform microclustering, which first groups nearby
objects into “microclusters” and then performs k-means clustering on the
microclusters.

Dr.A.VEERASWAMY SACET-CSE
UNIT-6 DATAWAREHOUSING AND MINING

Advantages
1. With a large number of variables, k-means nay be computationally faster than
hierarchical clustering (if k is small).
2. K-means may produce tighter clusters than hierarchical clustering, especially
if the clusters are globular.
Disadvantages
 Difficult in comparing the quality of the clusters produced.
 Applicable only when mean is defined.
 Need to specify k, the number of clusters, in advance.
 Unable to handle noisy data and Outliers.
 Not suitable to discover cluster with non-convex shapes.

Representative Object-Based Technique: The k-Medoids Method

This algorithm is very similar to k-Means with the small exception of
instead of creating an artificial point, k-Medoids recalculates from the nearest actual
point in a data set. The reason for this is that it is not acceptable to outliers that are
extremely far away from it.
K-medoids clustering works as following:
 The initial representative objects (or seeds) are chosen arbitrarily.
 The iterative process of replacing representative objects by nonrepresentative
objects continues as long as the quality of the resulting clustering is improved.
 Quality is estimated using a cost function that measures the average
dissimilarity between an object and the representative object of its cluster.
 To determine whether a nonrepresentative object, orandom, is a good
replacement for a current representative object, 0j, the following four cases are
examined for each of the nonrepresentative objects, p.

Case 1: p currently belongs to representative object, oj. If oj is replaced by orandom as a

representative object and p is closest to one of the other representative objects, oj, i ≠
j, then p is reassigned to oi.

Case 2: p currently belongs to representative object, oi. if oj is replaced by 0random as a

representative object and p is closest to orandom, then p is reassigned to orandom.

Case 3: p currently belongs to representative object, oi. i ≠ j is replaced by 0random as a

representative object and p is still closest to oi, then the assignment does not change.

Case 4: p currently belongs to representative object, oi. i ≠ j. if oj is replaced by

0random as a representative object and p is closest to orandom, then p is reassigned to
oramdom.

Dr.A.VEERASWAMY SACET-CSE
UNIT-6 DATAWAREHOUSING AND MINING

PAM (Partitioning Around Medoids) was one of the first k-medoids algorithms
introduced. It attempts to determine k partitions for n objects. After an initial random
selection of k representative objects, the algorithm repeatedly tries to make a better
choice of cluster representative. The final set of representative objects are the
respective medoids of the clusters (or) centered objects.

Algorithm: k-medoids. PAM, a k-medoids algorithm for partitioning based on

medoid clusters. Or central objects.
Input:
K: the number of clusters.
D: a data set containing n objects.
Output: A set of k clusters.
Method:
1. Arbitrarily choose k objects in D as the initial representative objects or seeds;
2. Repeat
3. Assign each remaining object to the cluster with the nearest representive
object;
4. Randomly select a non representive object, orandom
5. Compute the total cost, s, of swapping representive object, oj, with orandom;
6. If s<0 then swap o j with orandom to form the new set of k represntive objects;
7. Until no change;

Advantages
 Pam is very robust to the existence of outliers. The clusters found by this
method do not depend on the order in which the objects are examined.
 PAM works effectively for small data sets.

Disadvantages
 It does not sale well for large data sets.

7.4.2 Partitioning methods in large databases

From k-medoids to ClARANS
CLARA (clustering large applications) overcomes the problem of
scalability
that k-medois suppers from. CLARA draws a simple of data set, and applies PAM on
this sample to determine the optimal set of mediods from the sample. If then classifies
the remaining objects using the partitioning principle. If the sample were drawn in a
sufficiently random way, the methods of the sample would approximate the mediods
of the entire data set.

CLARA algorithm:
Input: Database of d objects.
Repeat for m times
Draw a sample s D randomly from D
Call PAM(s,k) To get k mediods.
Classify the entire data set D to c1,c2, -ck.
Caliculate the quality of clustering as the average dissimilarity
End

Dr.A.VEERASWAMY SACET-CSE
UNIT-6 DATAWAREHOUSING AND MINING

To improve the quality of CLARA, we go for CLARANS (clustering large

applications based on RANdomized search). Instead of clustering on the whole
dataset, CLARANS clusters on a sample of the database.

CLARANS works as follows:

 CLARANS draws sample with some randomness in each step of search
 At each step, PAM examines all the neighbors of the current node in its search
for a minimum cost solution. The current node is then repulse by the neighbor
with the largest descent in costs.
 If a better neighbor is found(i.e having a low error), CLARANS moves to the
neighbors node and the process starts again;
 Otherwise, the current clustering produces a local minimum. If a local
minimum is found, CLARANS starts with new randomly selected nodes in
search of a new local minimum.
 Once a user-specified number of a local minima has been found, the algorithm
output, as a solution, the best local minimum, that is, the local minimum
having the lowest cost.
Advantages
 More efficient then k-means and k-mediods.
 Used to detect outliers.
Dis advantages
 Me not a find a local minimum ue to the trimming of its searching.
 It assumes that all objects fit into the main memory, and the result is sensitive
to input order.

7.5 HIERARCHICAL METHOD

Basically hierarchical methods group data into a tree of clusters. There are two
basic varieties herarsial algorithms; agglomerative and divisive. A tree structure
called a dendrogram is commonly used to represent the process of hierarchical
clustering.

7.5.1 Agglomerative and divisive hierarchical clustering

 Agglomerative hierarchical methods (bottom-up strategy)

Begin with as many clusters as objects. Clusters are successively merged until
only one cluster remain.
 Divisive hierarchical methods(top-down strategy)
Begin with all objects in one cluster. Group are continually divided until there
are as many clusters as objects.
Steps in agglomerative hierarchical clustering
1. Compute the proximity matrix.
2. Let each data point be a cluster
3. Repeat
4. Merge the two closest clusters.
5. Update the proximity matrix
6. Until only a single cluster remains

Dr.A.VEERASWAMY SACET-CSE
UNIT-6 DATAWAREHOUSING AND MINING

7.5.2 BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies

BIRCH partitions objects hierarchically using tree structures and then refines
the clusters using other clustering methods. I t defines clustering feature and an
associated tree structure that summarizes a cluster. The tree (CF tree) is a height-
balanced tree that stores cluster information. BIRCH doesn‟t produce spherical
clusters and may produce unintended cluster. These structures help the clustering
method achieve good speed and scalability in large databases and also make it
effective for incremental and dynamic clustering of incoming objects.
BIRCH applies a multiphase clustering technique, a single scan of the data set
yields a basic good clustering, and one or more additional scans can (optionally) be
used to further improve the quality.
The primary phases are:

Dr.A.VEERASWAMY SACET-CSE
UNIT-6 DATAWAREHOUSING AND MINING

Phase 1: BIRCH scans the database to build an initial in-memory CF tree, which can
be viewed as a multilevel compression of the data that tries to preserve the inherent
clustering structure of the data.
Phase 2: BIRCH applies a (selected) clustering algorithm to cluster the leaf nodes of
the CF tree, which removes sparse clusters as outliers and groups dense clusters into
larger ones.
 Clustering features
 Summary of the statistics for a given subclusters, the 0th, 1st, and 2nd
moments of the subcluster from the statistical point of view.
 Registers crucial measurements for computing cluster and utilizes
storage efficiently.

CF1 CF2 ……. CFk Root level

CF11 CF2 ……. CFk First level

Figure 7.9 CF Tree

 A CF tree is a height balanced tree that stores the clustering features for a
hierarchical clustering.
 A non leaf node in a tree has descendents or “Children”.
 The non leaf nodes stores sums of the CFs of their children.
 A CF tree has two parameters
 Branching factor: specify the maximum number of children.
 Threshold : Maximum diameter of sub-clusters stored at the left
nodes

Drawback
Handles only numeric data and sensitive to the order of the data record.

7.5.3 CURE: Clustering Using Representatives

Creates cluster by sampling the database and shrinks them toward the center of
the cluster by a specified fraction. Obviously better in runtime but lacking in
precision. It is robust against outliers and creates clusters of differing sizes that are not
necessarily spherical. CURE fails when clustering categorical data. Ignores aggregate
interconnectivity of objects in separate clusters.
Steps:
 Draw a random sample, of the original objects.
 Partition sample S into a set of partitions and form a cluster for each
partion.
 Representative points are found by selecting a constant number of
points from a cluster and then “shrinking” them toward the centre of
the cluster.

Dr.A.VEERASWAMY SACET-CSE
UNIT-6 DATAWAREHOUSING AND MINING

 Cluster similarity is the similarity of the closest pair of representative

points from different clusters
 Shrinking representative points toward the center helps avoid
problems with noise and outliers
 CURE is better able to handle clusters of arbitrary shapes and sizes

7.5.4 ROCK (Robust Clustering using links) : A Hierarchical Clustering

Algorithm for Categorical Attributes
Merges clusters based on their interconnectivity. Great for categorical data.
Ignores information about the looseness of two clusters while emphasizing
interconnectivity.
Steps:
 Obtain a sample of points from the data set
 Compute the link value for each set of points, i.e., transform the original
similarities (computed by Jaccard coefficient) into similarities that reflect the
number of shared neighbors between points
 Perform an agglomerative hierarchical clustering on the data using the
“number of shared neighbors” as similarity measure and maximizing “the
shared neighbors” objective function
 Assign the remaining points to the clusters that have been found
Link Measure in Rock
Links : Number of common neighbors
 C1 < a, b, c, d, e> : {a, b, c}, {a, b, d}, {a, b, e}, {a, c, d}, {a, c, e}, {a, d, e}, {b, c,
d}, {b, c, e}, {c, d, e}
 C2 <a, b, f, g> : {a, b, f}, {a, b, g}, {a, f, g}, {b, f, g}
Let T1= {a, b, c}, T2={c, d, e}, T3= {a, b, f}
 Link (T1, T2) = 4, since they have 4 common neighbors. i.e., {a, c, d}, {a, c, e},
{b, c, d}, {b, c, e}
 Link (T1,T2) = 3, since they have 3 common neighbors. i.e., {a, b, d}, {a, b, e},
{a, b, g}

7.5.5 Chameleon: A Hierarchical Clustering Algorithm Using Dynamic

Modeling

Uses dynamic modeling. It basically constructs a sparse k-nearest neighbor

graph, then partitions the graph into pieces and then clusters the pieces together.
Produces more natural clusters than DBSCAN (It is possible to have non-spherical
clusters.) Uses density measurements to determine the k-nearest neighbor.
Features:
 Adapt to the characteristics of the data set to find the natural clusters
 Use a dynamic model to measure the similarity between clusters
 Main property is the relative closeness and relative inter-connectivity
of the cluster

Dr.A.VEERASWAMY SACET-CSE
UNIT-6 DATAWAREHOUSING AND MINING

 Two clusters are combined if the resulting cluster shares certain

properties with the constituent clusters
 The merging scheme preserves self-similarity
 One of the areas of application is spatial data

Figure 7.10 Chameleon: Hierarchical clustering based on k-nearst neighbors and

dynamic
modeling.
Chameleon Steps
 Preprocessing – Represent the Data by a Graph
 Given a set of points, construct the k-nearest-neighbor (k-NN) graph
to capture the relationship between a point and its k nearest
neighbors
 Concept of neighborhood is captured dynamically (even if region is
sparse)
 Phase 1: Use a multilevel graph partitioning algorithm on the graph to find a
large number of clusters of well-connected vertices
 Each cluster should contain mostly points from one “true” cluster, i.e.,
is a sub-cluster of a “real” cluster
 Phase 2: Use Hierarchical Agglomerative Clustering to merge sub-clusters
 Two clusters are combined if the resulting cluster shares certain
properties with the constituent clusters
 Two key properties used to model cluster similarity:
 Relative Interconnectivity: Absolute interconnectivity of two
clusters normalized by the internal connectivity of the clusters
 Relative Closeness: Absolute closeness of two clusters
normalized by the internal closeness of the clusters

Dr.A.VEERASWAMY SACET-CSE

DWM Unit-Vi
No ratings yet
DWM Unit-Vi
30 pages
Data Mining: Clustering Essentials
No ratings yet
Data Mining: Clustering Essentials
18 pages
Clustering
0% (1)
Clustering
127 pages
NoteSCK3483 7b Clustering
No ratings yet
NoteSCK3483 7b Clustering
24 pages
Cluster Analysis Techniques
No ratings yet
Cluster Analysis Techniques
7 pages
Clustering Lecture 1: Basics: Jing Gao
No ratings yet
Clustering Lecture 1: Basics: Jing Gao
62 pages
Cluster Analysis and DBSCAN
No ratings yet
Cluster Analysis and DBSCAN
44 pages
Lecture 2. Similarity Measures For Cluster Analysis
No ratings yet
Lecture 2. Similarity Measures For Cluster Analysis
31 pages
Clustering for Data Analysts
No ratings yet
Clustering for Data Analysts
69 pages
TE IT DMBI Module2 Data Preprocessing L8-L11
No ratings yet
TE IT DMBI Module2 Data Preprocessing L8-L11
73 pages
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
No ratings yet
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
30 pages
2 Similarity Disimilarity Measure
No ratings yet
2 Similarity Disimilarity Measure
35 pages
DMi 03-Proximity
No ratings yet
DMi 03-Proximity
51 pages
Data Science: Department of Computer Science & Engineering
No ratings yet
Data Science: Department of Computer Science & Engineering
31 pages
Unit - 4 DMA
No ratings yet
Unit - 4 DMA
145 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
51 pages
Clustering and Applications and Trends in Data Mining
No ratings yet
Clustering and Applications and Trends in Data Mining
42 pages
UNIT-6 Cluster Analysis: Datawarehousing and Mining
No ratings yet
UNIT-6 Cluster Analysis: Datawarehousing and Mining
4 pages
Lecture 4
No ratings yet
Lecture 4
33 pages
Mod 4 Types of Data in Cluster Analysis
No ratings yet
Mod 4 Types of Data in Cluster Analysis
31 pages
Chp-10 (Topic Not in Book) Types of Data in Cluster Analysis.
No ratings yet
Chp-10 (Topic Not in Book) Types of Data in Cluster Analysis.
13 pages
Class 1c - DataFundamentals
No ratings yet
Class 1c - DataFundamentals
27 pages
DMi 03 Proximity
No ratings yet
DMi 03 Proximity
9 pages
02data Part4
No ratings yet
02data Part4
28 pages
02 Tinh Khoang Cach - Compatibility Mode
No ratings yet
02 Tinh Khoang Cach - Compatibility Mode
14 pages
Cluster Analysis in Construction
No ratings yet
Cluster Analysis in Construction
23 pages
Lecture 10
No ratings yet
Lecture 10
26 pages
Chapter - 2 Data Mining
No ratings yet
Chapter - 2 Data Mining
21 pages
Chapter 2
No ratings yet
Chapter 2
70 pages
Cluster Analysis: Introduction - I: Dr. A. Ramesh
No ratings yet
Cluster Analysis: Introduction - I: Dr. A. Ramesh
28 pages
X Chapter 02 Data
No ratings yet
X Chapter 02 Data
67 pages
Measure of Proximity
No ratings yet
Measure of Proximity
11 pages
18CSE397T - Computational Data Analysis Unit - 3: Session - 8: SLO - 2
No ratings yet
18CSE397T - Computational Data Analysis Unit - 3: Session - 8: SLO - 2
4 pages
BCA Semester VI Data Mining Module 4 (Presentation Kind of N
No ratings yet
BCA Semester VI Data Mining Module 4 (Presentation Kind of N
56 pages
ML12 Clustering
No ratings yet
ML12 Clustering
34 pages
STAT243 Chapter 2 - Section 2.4
No ratings yet
STAT243 Chapter 2 - Section 2.4
41 pages
Data Similarity
0% (1)
Data Similarity
18 pages
Lecture 07 2025 Clustering Prepr
No ratings yet
Lecture 07 2025 Clustering Prepr
17 pages
Cluster Analysis Data Types
No ratings yet
Cluster Analysis Data Types
51 pages
9-2 Data Analysis and Pre-Processing Part 2 PDF
No ratings yet
9-2 Data Analysis and Pre-Processing Part 2 PDF
27 pages
Mod 4
No ratings yet
Mod 4
49 pages
Lec 5
No ratings yet
Lec 5
24 pages
ML Unit 2
No ratings yet
ML Unit 2
11 pages
Clustering 1
No ratings yet
Clustering 1
75 pages
Cluster Analysis Essentials
No ratings yet
Cluster Analysis Essentials
24 pages
Introduction To Data Science: Tom A S Horv Ath
No ratings yet
Introduction To Data Science: Tom A S Horv Ath
39 pages
DM Clustering
No ratings yet
DM Clustering
51 pages
Data Mining Formula
No ratings yet
Data Mining Formula
7 pages
Pattern Recognition - Clustering - Classification
No ratings yet
Pattern Recognition - Clustering - Classification
177 pages
Similarity Measures
No ratings yet
Similarity Measures
11 pages
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
26 pages
Lecture 7 Clustring
No ratings yet
Lecture 7 Clustring
10 pages
Data Mining and Predictive Modeling: Lecture 13: Measuring Data Similarity
No ratings yet
Data Mining and Predictive Modeling: Lecture 13: Measuring Data Similarity
19 pages
CSC 452 DM Lecture02 Know Your Data B 13102020 014200pm
No ratings yet
CSC 452 DM Lecture02 Know Your Data B 13102020 014200pm
26 pages
Data Mining: Similarity and Distance
No ratings yet
Data Mining: Similarity and Distance
6 pages
Clustering Techniques Comparison
No ratings yet
Clustering Techniques Comparison
18 pages
Unit 4 - Data Warehousing and Mining
No ratings yet
Unit 4 - Data Warehousing and Mining
51 pages
Data Mining With Python Theory Application and Case Studies 1st Edition Di Wu Full
100% (5)
Data Mining With Python Theory Application and Case Studies 1st Edition Di Wu Full
85 pages
Concepts and Techniques: - Chapter 7
No ratings yet
Concepts and Techniques: - Chapter 7
123 pages
Unit V - Clustering
No ratings yet
Unit V - Clustering
19 pages
Clustering in AI
No ratings yet
Clustering in AI
16 pages
Partitioning Methods
100% (1)
Partitioning Methods
3 pages
Clustering Algorithms Explained
No ratings yet
Clustering Algorithms Explained
3 pages
K-Means Clustering Guide
No ratings yet
K-Means Clustering Guide
32 pages
Seperated
No ratings yet
Seperated
11 pages
Unit 04 - Unsupervised Learning - MD
No ratings yet
Unit 04 - Unsupervised Learning - MD
27 pages
Chapter
100% (1)
Chapter
101 pages
Cluster Analysis Introduction (Unit-6)
No ratings yet
Cluster Analysis Introduction (Unit-6)
20 pages
R20 Machine Learning Unit 4
No ratings yet
R20 Machine Learning Unit 4
49 pages
Clustering Algorithms Guide
No ratings yet
Clustering Algorithms Guide
85 pages
ML Unsupervised
No ratings yet
ML Unsupervised
35 pages
Unit V: Distance and Rule Based Models
No ratings yet
Unit V: Distance and Rule Based Models
56 pages
K Medroids
No ratings yet
K Medroids
13 pages
Clustering Complexity Analysis
No ratings yet
Clustering Complexity Analysis
7 pages
Comparison Analysis of Euclidean and Gower Distanc
No ratings yet
Comparison Analysis of Euclidean and Gower Distanc
8 pages
CS 260D Final Project Paper
No ratings yet
CS 260D Final Project Paper
18 pages
Unit-5 DWDM Material
No ratings yet
Unit-5 DWDM Material
19 pages
Cluster Analysis Techniques Guide
No ratings yet
Cluster Analysis Techniques Guide
152 pages
Robust Detection of Multiple Outliers in A Multivariate Data Set
No ratings yet
Robust Detection of Multiple Outliers in A Multivariate Data Set
30 pages
Data Mining 2
No ratings yet
Data Mining 2
9 pages
Finance Time Series Clustering
No ratings yet
Finance Time Series Clustering
50 pages
Stageverslag Roelofsen Tcm235 882304
No ratings yet
Stageverslag Roelofsen Tcm235 882304
83 pages
Data Warehousing Quiz for IT Students
78% (147)
Data Warehousing Quiz for IT Students
34 pages
Chapter - 5 Machine Learning
0% (1)
Chapter - 5 Machine Learning
25 pages