0% found this document useful (0 votes)

30 views43 pages

Lecture 8-9 - Clustering

The document provides an overview of basic cluster detection techniques covered in Lectures 8 and 9. Lecture 8 discusses measuring proximity between data objects and the K-means cluster detection method. Lecture 9 covers agglomerative clustering, performance issues of basic methods, cluster evaluation, and interpretation. The document also defines clusters, discusses requirements for clustering solutions, and describes various measures of proximity that can be used to determine similarity between data objects.

Uploaded by

johndeuterok

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views43 pages

Lecture 8-9 - Clustering

Uploaded by

johndeuterok

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 43

Lecture 8-9

Basic techniques for cluster

detection

1
Overview

Lecture 8
• The problem of cluster detection
• Measuring proximity between data objects
• The K-means cluster detection method

Lecture 9
• The agglomeration cluster detection method
• Performance issues of the basic methods
• Cluster evaluation and interpretation
• Summary

2
Lecture 8

3
Clustering

• Clustering techniques can be categorized as either hierarchical or

nonhierarchical

Clustering

Hierarchical Nonhierarchical

Agglomerative Divisive K-means SOM

4
Clustering

• Divisive hierarchical clustering starts from the whole data set in one
cluster, and then breaks this up in each time smaller clusters until
one observation per cluster remains.
• Agglomerative clustering starts from all observations in one cluster
and continuing to merge the ones that are most similar until all
observations make up one big cluster.

5
Recall: Data Mining: Problems & Patterns
• Cluster detection
– Measure similarity among data objects and group them into
clusters accordingly

Clustering
Method

Input data points Cluster Memberships

of Data Points

6
Problem of Cluster Detection
• What is cluster detection?
• Clustering is a process of discovering clusters
• Cluster: a group of objects known as members
• The centre of a cluster is known as the centroid
• Members of a cluster are similar to each other (low
intra-cluster)
• Members of different clusters are different (high inter-
cluster)

: centroids

7
Problem of Cluster Detection

• Basic elements of a
clustering solution ?
?
• A sensible measure for
similarity, e.g. Euclidean
• A goodness-of-fit function ?
for evaluating the quality of
resulting clusters, e.g. SSE
• An effective and efficient
Internal Inter-cluster
clustering algorithm, e.g. K- variation distance
means

Good or Bad?

8
Problem of Cluster Detection
• Requirements for clustering solutions
• Scalability – speed, size
• Able to deal with different data types of attributes
• Able to discover clusters of arbitrary shapes
• Minimal requirements for domain knowledge to
determine input parameters
• Able to deal with noise and outliers
• Insensitive to order of input data records
• Able to deal with high dimensionality
• Incorporation of user-specified constraints
• Interpretability and usability

9
Measures of Proximity
• Basics
• Proximity between two data objects is represented by
either similarity or dissimilarity
• Similarity: a numeric measure of the degree of alikeness,
dissimilarity: numeric measure of the degree of difference
between two objects
• Similarity measure and dissimilarity measure are often
convertible; normally dissimilarity is preferred
• Measure of dissimilarity:
• Measuring the difference between values of the corresponding
attributes
• Combining the measures of the differences into one value

10
Measures of Proximity
• A function D with a measurement of distance d(x, y)
is called a metric if it satisfies:
• d(x, y)  0 and d(x, x) = 0, for all data objects x and y
• d(x, y) = d(y, x), for all data objects x and y
• d(x, y)  d(x, z) + d(z, y), for all data objects x, y and z
• Difference of values for a single attribute is directly related
to the domain type of the attribute.
• Proximity function with some of the properties can still be
useful.
• Some measure is better than no measure at all.

11
Measures of Proximity
• Value differences for different attribute types
• nominal values
• If two names are the same, the difference is 0; otherwise the maximum
e.g. diff(“John”, “John”) = 0, diff(“John”, “Mary”) = 
• Same for difference between binary values
e.g. diff(Yes, No) = 
• ordinal values
• Different degree of proximity can be compared
e.g. diff(A, B) < diff(A, D).
• Converting ordinal values to consecutive integers
e.g. A: 5, B: 4, C: 3, D: 2, E:1. A – B  1 and A – D  3
• interval and ratio attributes
• Measured as the absolute difference between 2 numerical values
• Unknown value
• Assume the unknown value to be maximally different from any other
value
diff(NULL, v) = |v|, diff(NULL, NULL) = 

12
Measures of Proximity
• Ratio of mismatched features for nominal attributes
Given two data objects i and j of p nominal attributes. Let m
represent the number of attributes where the values of the two
objects match, the RMF

𝑝−𝑚
𝑅𝑀𝐹(𝑖, 𝑗) = [0,1]
𝑝

Body Weight Body Height Blood Pressure Blood Sugar Habit Class
e.g. heavy short high 3 smoker P 6−4 1
heavy short high 1 nonsmoker P d (row1, row2) = =
normal tall normal 3 nonsmoker N 6 3
heavy tall normal 2 smoker N
low medium normal 2 nonsmoker N
low tall normal 1 nonsmoker P
normal medium high 3 smoker P 6 −1 5
low short high 2 smoker P d (row1, row3) = =
heavy tall high 2 nonsmoker P 6 6
low medium normal 3 smoker P
heavy medium normal 3 nonsmoker N

13
Measures of Proximity
Distance functions for interval/ratio attributes

• Minkowski function
d (i, j) = q (| x − x | + | x − x | +...+ | x − x | )
q q q
i1 j1 i2 j2 ip jp
xi1 is the ith feature value of object i, p is the number of features.
Special cases:
➢ Manhattan distance (q = 1) d (i, j) =| x − x | + | x − x | +...+ | x − x |
i1 j1 i2 j2 ip jp
➢ Euclidean distance (q = 2) d (i, j) = (| x − x |2 + | x − x |2 +...+ | x − x |2 )
i1 j1 i2 j2 ip jp

➢ Supremum/Chebyshev (q = ) 𝑑 𝑖, 𝑗 = max 𝑥𝑖𝑡 − 𝑥𝑗𝑡

𝑡
for t = 1…p

14
Measures of Proximity
Distance functions for interval/ratio attributes

customerID No of Trans Revenue Tenure(Months)

101 30 1000 20
102 40 400 30 Manhattan
103 35 300 30
104 20 1000 35 Tenure Euclidean
105 50 500 1 40
Chebyshev
106 80 100 10
107 10 1000 2 30

d1 (cust101, cust102) =| 30 − 40 | + | 1000 − 400 | + | 20 − 30 |= 620 20

10
d 2 (cust101, cust102) = (30 − 40) + (1000 − 400) + (20 − 30)  600.16
2 2 2

200 10 20 30 40 50 No. of Trans

dmax (cust101, cust102) =| 1000 − 400 |= 600 400
600
800
1000

Revenue

15
Measures of Proximity
Binary attributes
• Given two data objects i and j of p binary attributes,
• f00 : the number of attributes where i is 0 and j is 0
• f01 : the number of attributes where i is 0 and j is 1
• f10 : the number of attributes where i is 1 and j is 0
• f11 : the number of attributes where i is 1 and j is 1

• Simple mismatch coefficient (SMC) for symmetric values:

f 01 + f10
SMC (i, j ) =
f 00 + f 01 + f10 + f11

• Jaccard coefficient is defined for asymmetric values:

f 01 + f10
JC (i, j ) =
f 01 + f10 + f11

16
Measures of Proximity
• Binary attributes (example)

DocumentID Query Database Programming Interface User Usability Network Web GUI HTML
d1 1 1 0 0 0 0 0 0 0 0
d2 0 1 1 0 0 0 0 0 0 0
d3 0 1 0 0 0 0 0 0 0 0

f 01 + f10 1+1 2 f 01 + f10 1+1 2

SMC(d1, d 2) = = = JC(d1, d 2) = = =
f 00 + f 01 + f10 + f11 7 + 1 + 1 + 1 10 f 01 + f10 + f11 1 + 1 + 1 3
SMC  not that different; JC  very different: two-word (out of 3) difference

f 01 + f10 1 1 f 01 + f10 1 1
SMC(d1, d 3) = = = JC(d1, d 3) = = =
f 00 + f 01 + f10 + f11 8 + 1 + 1 10 f 01 + f10 + f11 1 + 1 2
SMC  very similar; JC  still quite different: one word (out of 2) difference

17
Measures of Proximity
• Cosine similarity function
• Treating two data objects as vectors
• Similarity is measured as the angle  between the two vectors
• Similarity is 1 when  = 0, and 0 when  = 90
• Similarity function:

i• j n


n
cos(i, j ) =
i j || i ||=
2 j
i• j = ik 
|| i ||  || j || k =1
k k
k =1
i

18
Measures of Proximity
• Cosine similarity function (example)

Given two data objects:

x = (3, 2, 0, 5), and y = (1, 0, 0, 0)
Since,
x • y = 3*1 + 2*0 + 0*0 + 5*0 = 3
||x|| = sqrt(32 + 22 + 02 + 52)  6.16
||y|| = sqrt(12 + 02 + 02 + 02) = 1
Then, the similarity between x and y: cos(x, y) = 3/(6.16 * 1) = 0.49
The dissimilarity between x and y: 1 – cos(x,y) = 0.51

19
Measures of Proximity
• Combining heterogeneous attributes
• Based on the principle of ratio of mismatched features
• For the kth attribute, compute the dissimilarity dk in [0,1]
• Set the indicator variable k as follows:
• k = 0, if the kth attribute is an asymmetric binary attribute and both
objects have value 0 for the attribute
• k = 1, otherwise
• Compute the overall distance between i and j as:


k =1
k  dk
d (i, j ) = n


k =1
k

20
Measures of Proximity

• Combining heterogeneous attributes (example)

SubjectID Body Height Body Blood Habit Blood

(cm) Weight (KG) Pressure Sugar
S1 125 61 High Smoker High
S2 178 90 Normal Nonsmoker Low

• D(S1, S2) = (1 x 0.8833)+(1x0.4603)+(1x1)+(1x1)+(1x1)/(1+1+1+1+1)

= 4.3437/5
= 0.8687

Read the chapter for more explanation on the data and calculation.
21
K-means, a Basic Clustering Method

• Outline of main steps

• Define the number of clusters (k)
• Choose k data objects randomly to serve as the initial centroids for
the k clusters
• Assign each data object to the cluster represented by its nearest
centroid
• Find a new centroid for each cluster by calculating the mean vector
of its members
• Undo the memberships of all data objects. Go back to Step 3 and
repeat the process until cluster membership no longer changes or a
maximum number of iterations is reached.

22
K-means, a Basic Clustering Method
• Illustration of the method:

23
K-means, a Basic Clustering Method

• Strengths & weaknesses

• Strengths
• Simple and easy to implement
• Quite efficient
• Weaknesses
• Need to specify the value of k, but we may not know what the
value should be beforehand
• Sensitive to the choice of initial k centroids: the result can be
non-deterministic
• Sensitive to noise
• Applicable only when mean is meaningful to the given data set

24
K-means, a Basic Clustering Method

• Overcoming the weaknesses:

• Using cluster quality to determine the value of k
• Improving how the initial k centroids are chosen
• Running the clustering a number of times and select the result
with highest quality
• Finding centres that are farther apart
• Dealing with noise
• Removing outliers before clustering

25
K-means, a Basic Clustering Method
• Value of k and cluster quality

Screen plot
Cluster errors (e.g. SSE)

Number of
clusters

26
Lecture 9

27
Hierarchical Clustering
Create a hierarchy of clusters for cluster analysis.

• Two different approaches

• Divisive approach
• Develops the hierarchy from
the root to the leaves by
splitting bigger clusters into
smaller ones (similar to
bisecting K-means)
• Agglomerative approach
• Develops the hierarchy from
the leaf nodes to the root by
merging smaller clusters into
bigger ones.

28
Hierarchical Clustering

• Agglomerative Approach
• Take all n data objects as individual clusters and build a n x n
dissimilarity matrix. The matrix stores the distance between any pair
of data objects.
• While the number of clusters > 1 do:
• Find a pair of data objects/clusters with the minimum distance
• Merge the two data objects/clusters into a bigger cluster
• Replace the entries in the matrix for the original clusters or
objects by the cluster tag of the newly formed cluster
• Re-calculate relevant distances and update the matrix

29
Hierarchical Clustering

• Produces a set of nested clusters organized as a hierarchical tree

• Can be visualized as a dendrogram
• A tree like diagram that records the sequences of merges or splits

6 5

0.2 4
3 4
2
0.15 5
2
0.1

1
0.05 3 1

0
1 3 2 5 4 6

30
Strengths of Hierarchical Clustering

• Do not have to assume any particular number of clusters

• Any desired number of clusters can be obtained by ‘cutting’ the
dendogram at the proper level
• They may correspond to meaningful taxonomies
• Example in biological sciences (e.g., animal kingdom, phylogeny
reconstruction, …)

31
The Agglomeration Method
• Illustration of the method

32
The Agglomeration Method

• Agglomeration approaches for

merging clusters
• Single link: the distance
between two closest points
• Complete link: the distance
between two farthest points
• Group average: the average of
all pair-wise distances
• Centroids: the distance between
the centroids

33
Single Link Agglomeration Method

• Similarity of two clusters is based on the two most similar (closest)

points in the different clusters
• Determined by one pair of points, i.e., by one link in the proximity graph.

34
Single Link Agglomeration Method (cont)

The distance of other objects to s2/s3 is computed

based on the minimum distance value.

35
The Agglomeration Method

• Strengths and weaknesses

• Strengths
• Deterministic results
• Multiple possible versions of clustering
• No need to specify the value of a k beforehand
• Can create clusters of arbitrary shapes (single-link)
• Weaknesses
• Does not scale up for large data sets
• Cannot undo membership like the K-means

36
Cluster Evaluation
• Cluster quality
• Principle:
• low-level of members variation within a cluster (intra-cluster)
• High-level of dissimilarity between clusters (inter-cluster)
• The measures
• Cohesion: sum of squared errors (SSE), within-cluster variation (wc),
sum of within-cluster variation (WC)
• Separation: sum of squared distances between cluster centroids
(BC)
• Combining the cohesion and separation, the ratio BC/WC is a good
indicator of overall quality (Q)

37
Cohesion
• Within cluster variation: Average squared distance between members in a
cluster and its centroid
Cluster Ck: cluster k
SubjectID Body Height Body Weight
Tag rk: centroid of Ck
s1 125 61 2
s2 178 90 1
s3
s4
178
180
92
83
1
1 𝑆𝑆𝐸(𝐶𝑘 ) = ෍ 𝑑(𝑥, 𝑟𝑘 )2 Sum of Squared Error)
s5 167 85 1 𝑥∈𝐶𝑘
s6 170 89 1 1
s7 173 98 1 𝑤𝑐(𝐶𝑘 ) = SSE(𝐶𝑘 )
s8 135 40 2
𝐶𝑘
s9 120 35 2 K
s10
s11
145
125
70
50
2
2 WC =  wc (Ck ) (total within cluster variation)
k =1
Centroid cluster 1 (174.33, 89.5) Centroid cluster 2 (130, 51.2)

 C1 is a better quality cluster than C2. WC = (45.806 + 247.760） = 293.57 38

Separation Measure clusters

• If the cluster centroids are very dissimilar, their quality is higher.

• This can be measured by using the distance between the clusters.

BC =  d (r j , rk ) 2
1 j  k  K
rk, rj: centroid of Ck Cj, for cluster k and
j, respectively,
Cluster
d = Euclidean distance
SubjectID Body Height Body Weight
Tag
s1 125 61 2
s2 178 90 1
s3 178 92 1
s4 180 83 1
s5 167 85 1
s6 170 89 1
s7 173 98 1
s8 135 40 2
s9 120 35 2
s10 145 70 2
s11 125 50 2

Centroid cluster 1 (174.33, 89.5) Centroid cluster 2 (130, 51.2)

39
Overall cluster quality

• Combining the cohesion and separation, the ratio BC/WC is a good

indicator of overall quality (Q)

𝐵𝐶
Q=
𝑊𝐶
The larger the value of this ratio, the better the cluster quality.
A value close to 1 indicate poor quality clusters: the distance between
clusters is close to the distance within clusters.
Cluster
SubjectID Body Height Body Weight
Tag
s1 125 61 2
s2 178 90 1
s3 178 92 1
s4 180 83 1
s5 167 85 1
s6 170 89 1
s7 173 98 1
s8 135 40 2
s9 120 35 2
s10 145 70 2
s11 125 50 2 40
Cluster Evaluation & Interpretation

• Using cluster quality for clustering

• With K-means:
• Add an outer loop for different values of K (from low to high)
• At an iteration, conduct K-means clustering using the current K
• Measure the overall cluster quality and decide whether the
resulting cluster quality acceptable
• If not, increase the value of K by 1 and repeat the process
• With agglomeration:
• Traverse the hierarchy level by level from the root
• At a level, evaluate the overall quality of clusters
• If the quality is acceptable, take the clusters at the level as the final
result. If not, move to the next level and repeat the process.

41
Summary

• A clustering solution must provide a sensible proximity function, effective

algorithm and a cluster evaluation function
• Proximity is normally measured by a distance function that combines
measures of value differences upon attributes
• The K-Means method continues to refine prototype partitions until
membership changes no longer occur
• The agglomeration method constructs all possible groupings of individual
data objects into a hierarchy of clusters
• Good clustering results mean high similarity among members of a cluster
and low similarity between members of different clusters

42
References

• Read Chapter 4 of Data Mining Techniques and Applications

Useful further references

• Tan, P-N., Steinbach, M. and Kumar, V. (2006), Introduction to Data Mining,
Addison-Wesley, Chapters 2 (section 2.4) and 8.

9-2 Data Analysis and Pre-Processing Part 2 PDF
No ratings yet
9-2 Data Analysis and Pre-Processing Part 2 PDF
27 pages
Chapter - 2 Data Mining
No ratings yet
Chapter - 2 Data Mining
21 pages
SEEM2460 Unsupervised Learning Clustering
No ratings yet
SEEM2460 Unsupervised Learning Clustering
76 pages
Pattern Recognition - Clustering - Classification
No ratings yet
Pattern Recognition - Clustering - Classification
177 pages
Unit - 4 DMA
No ratings yet
Unit - 4 DMA
145 pages
Lec 5
No ratings yet
Lec 5
24 pages
Clustering Lecture 1: Basics: Jing Gao
No ratings yet
Clustering Lecture 1: Basics: Jing Gao
62 pages
Clustering for Data Analysts
No ratings yet
Clustering for Data Analysts
69 pages
Lecture 10
No ratings yet
Lecture 10
26 pages
ML Co4 Session 29
No ratings yet
ML Co4 Session 29
36 pages
Data Science: Department of Computer Science & Engineering
No ratings yet
Data Science: Department of Computer Science & Engineering
31 pages
X Chapter 02 Data
No ratings yet
X Chapter 02 Data
67 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
51 pages
Unit 7 Clustering
No ratings yet
Unit 7 Clustering
56 pages
Cluster Analysis
No ratings yet
Cluster Analysis
60 pages
02data Part4
No ratings yet
02data Part4
28 pages
Mod 4 Types of Data in Cluster Analysis
No ratings yet
Mod 4 Types of Data in Cluster Analysis
31 pages
Introduction To Data Science: Tom A S Horv Ath
No ratings yet
Introduction To Data Science: Tom A S Horv Ath
39 pages
DM Clustering
No ratings yet
DM Clustering
51 pages
2 2 Data
No ratings yet
2 2 Data
27 pages
DM Chapter 5 (Clustering)
No ratings yet
DM Chapter 5 (Clustering)
40 pages
V DM Clustering
No ratings yet
V DM Clustering
76 pages
02 Data
No ratings yet
02 Data
35 pages
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
No ratings yet
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
30 pages
Cluster Analysis Essentials
No ratings yet
Cluster Analysis Essentials
24 pages
Data Similarity
0% (1)
Data Similarity
18 pages
Class 1c - DataFundamentals
No ratings yet
Class 1c - DataFundamentals
27 pages
1 Data Mining
No ratings yet
1 Data Mining
47 pages
BCA Semester VI Data Mining Module 4 (Presentation Kind of N
No ratings yet
BCA Semester VI Data Mining Module 4 (Presentation Kind of N
56 pages
03 - Data Mining
No ratings yet
03 - Data Mining
37 pages
Clustering 1
No ratings yet
Clustering 1
75 pages
Chapter 4 Descriptive Data Mining
No ratings yet
Chapter 4 Descriptive Data Mining
6 pages
Lecture 2. Similarity Measures For Cluster Analysis
No ratings yet
Lecture 2. Similarity Measures For Cluster Analysis
31 pages
Cluster Analysis in Construction
No ratings yet
Cluster Analysis in Construction
23 pages
Lecture24 s12
No ratings yet
Lecture24 s12
24 pages
Slides of Lecture 2 of CS3319 SJTU
No ratings yet
Slides of Lecture 2 of CS3319 SJTU
35 pages
K Medoids
No ratings yet
K Medoids
101 pages
Clustering: ISOM3360 Data Mining For Business Analytics
No ratings yet
Clustering: ISOM3360 Data Mining For Business Analytics
28 pages
DMi 03-Proximity
No ratings yet
DMi 03-Proximity
51 pages
Clustering
0% (1)
Clustering
127 pages
Data Mining for Analysts
No ratings yet
Data Mining for Analysts
43 pages
CS822 DataMining Week4
No ratings yet
CS822 DataMining Week4
45 pages
Knowing Your Data
No ratings yet
Knowing Your Data
43 pages
Cluster Analysis Techniques Guide
No ratings yet
Cluster Analysis Techniques Guide
97 pages
Chapter4 Clustering
No ratings yet
Chapter4 Clustering
77 pages
Data Mining: Clustering Essentials
No ratings yet
Data Mining: Clustering Essentials
18 pages
Unit 4
No ratings yet
Unit 4
65 pages
Lecture 5
No ratings yet
Lecture 5
53 pages
Chapter 2: Getting To Know Your Data
No ratings yet
Chapter 2: Getting To Know Your Data
30 pages
DWDM AR16 Unit 1.2
No ratings yet
DWDM AR16 Unit 1.2
14 pages
Module 4 ML
No ratings yet
Module 4 ML
11 pages
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
26 pages
DM Lec03
No ratings yet
DM Lec03
37 pages
DM Day3 Preprocessing A S25
No ratings yet
DM Day3 Preprocessing A S25
109 pages
CSC 452 DM Lecture02 Know Your Data B 13102020 014200pm
No ratings yet
CSC 452 DM Lecture02 Know Your Data B 13102020 014200pm
26 pages
DSV-S6 Measures of Similarity and Dissimilarity
No ratings yet
DSV-S6 Measures of Similarity and Dissimilarity
43 pages
Lecture 07 2025 Clustering Prepr
No ratings yet
Lecture 07 2025 Clustering Prepr
17 pages
DEU CSC5045 Intelligent System Applications Using Fuzzy - 4+clustering
No ratings yet
DEU CSC5045 Intelligent System Applications Using Fuzzy - 4+clustering
61 pages
Lecture 7 - Classification (Rules and Naïve Bayes)
100% (1)
Lecture 7 - Classification (Rules and Naïve Bayes)
19 pages
VDM Manual
No ratings yet
VDM Manual
192 pages
L03 Logic Overview-Q
No ratings yet
L03 Logic Overview-Q
35 pages
VDM++ Type Definitions Guide
No ratings yet
VDM++ Type Definitions Guide
48 pages
Perceptron in AI & Machine Learning
No ratings yet
Perceptron in AI & Machine Learning
37 pages
Line Generation Algorithm PDF
100% (1)
Line Generation Algorithm PDF
4 pages
Cayley Hamilton Theorem: Source Code
No ratings yet
Cayley Hamilton Theorem: Source Code
9 pages
3-1 Extrema On An Interval
No ratings yet
3-1 Extrema On An Interval
9 pages
Question Bank 2021-22
No ratings yet
Question Bank 2021-22
5 pages
Predicting Results of Brazilian Soccer League Matches: University of Wisconsin-Madison
No ratings yet
Predicting Results of Brazilian Soccer League Matches: University of Wisconsin-Madison
13 pages
Advanced Algorithms Problem Set
No ratings yet
Advanced Algorithms Problem Set
13 pages
Title: Implement Breadth-First Search Traversal: Department of Computer Science and Engineering
No ratings yet
Title: Implement Breadth-First Search Traversal: Department of Computer Science and Engineering
8 pages
COA Lab Questions For Semester Exam-Nov 2020
No ratings yet
COA Lab Questions For Semester Exam-Nov 2020
5 pages
Advance Algorithm
No ratings yet
Advance Algorithm
15 pages
PVP23 - ADSA - Question Bank
No ratings yet
PVP23 - ADSA - Question Bank
10 pages
Civil Engineering Integral Methods
No ratings yet
Civil Engineering Integral Methods
4 pages
07a70501 Digitalimageprocessing
No ratings yet
07a70501 Digitalimageprocessing
7 pages
Lecture 8 Gradient Descent For Non-Convex Functions
No ratings yet
Lecture 8 Gradient Descent For Non-Convex Functions
21 pages
Computer-Aided Engineering Exam
No ratings yet
Computer-Aided Engineering Exam
6 pages
Ee321a FCH
No ratings yet
Ee321a FCH
3 pages
Error Detection and Correction
No ratings yet
Error Detection and Correction
7 pages
Digital Filter Design and Algorithm Implementation With Embedded Signal Processors
No ratings yet
Digital Filter Design and Algorithm Implementation With Embedded Signal Processors
7 pages
Question Bank 2
No ratings yet
Question Bank 2
2 pages
Special Cases of Linear Programming Models (Part 3)
No ratings yet
Special Cases of Linear Programming Models (Part 3)
2 pages
ANN Example
No ratings yet
ANN Example
10 pages
Neural Networks Course Overview
No ratings yet
Neural Networks Course Overview
1 page
Convolutionalcodes
No ratings yet
Convolutionalcodes
30 pages
1017 Sol
No ratings yet
1017 Sol
2 pages
2 - Classification Models
No ratings yet
2 - Classification Models
52 pages
Lecture 17 Properties of Fourier Transform
No ratings yet
Lecture 17 Properties of Fourier Transform
6 pages
Lab 5
No ratings yet
Lab 5
3 pages
Chapter 5 Sorting
No ratings yet
Chapter 5 Sorting
8 pages
Linear Programming Module
100% (1)
Linear Programming Module
17 pages
Transsipment
No ratings yet
Transsipment
26 pages

Lecture 8-9 - Clustering

Uploaded by

Lecture 8-9 - Clustering

Uploaded by

Lecture 8-9

Basic techniques for cluster

• Clustering techniques can be categorized as either hierarchical or

Agglomerative Divisive K-means SOM

Input data points Cluster Memberships

➢ Supremum/Chebyshev (q = ) 𝑑 𝑖, 𝑗 = max 𝑥𝑖𝑡 − 𝑥𝑗𝑡

customerID No of Trans Revenue Tenure(Months)

d1 (cust101, cust102) =| 30 − 40 | + | 1000 − 400 | + | 20 − 30 |= 620 20

200 10 20 30 40 50 No. of Trans

• Simple mismatch coefficient (SMC) for symmetric values:

• Jaccard coefficient is defined for asymmetric values:

f 01 + f10 1+1 2 f 01 + f10 1+1 2

Given two data objects:

• Combining heterogeneous attributes (example)

SubjectID Body Height Body Blood Habit Blood

• D(S1, S2) = (1 x 0.8833)+(1x0.4603)+(1x1)+(1x1)+(1x1)/(1+1+1+1+1)

• Outline of main steps

• Strengths & weaknesses

• Overcoming the weaknesses:

• Two different approaches

• Produces a set of nested clusters organized as a hierarchical tree

• Do not have to assume any particular number of clusters

• Agglomeration approaches for

• Similarity of two clusters is based on the two most similar (closest)

The distance of other objects to s2/s3 is computed

• Strengths and weaknesses

 C1 is a better quality cluster than C2. WC = (45.806 + 247.760） = 293.57 38

• If the cluster centroids are very dissimilar, their quality is higher.

Centroid cluster 1 (174.33, 89.5) Centroid cluster 2 (130, 51.2)

• Combining the cohesion and separation, the ratio BC/WC is a good

• Using cluster quality for clustering

• A clustering solution must provide a sensible proximity function, effective

• Read Chapter 4 of Data Mining Techniques and Applications

Useful further references

You might also like