0% found this document useful (0 votes)

289 views32 pages

K Means Clustering Lecture

Clustering is the process of grouping unlabeled data points into clusters so that objects within the same cluster are more similar to each other than objects in different clusters. K-means clustering is a commonly used partitioning clustering algorithm that groups data points into k number of clusters defined by the user. It works by assigning each data point to the nearest cluster centroid, recalculating the centroid positions, and repeating this process until the centroids are stable or the maximum number of iterations is reached.

Uploaded by

Daneil Radcliffe

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

289 views32 pages

K Means Clustering Lecture

Uploaded by

Daneil Radcliffe

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 32

Clustering

Introduction

• Cluster: a collection of data objects

– Similar to one another within the same cluster
– Dissimilar to the objects in other clusters
• Clustering is unsupervised classification:
no predefined classes
• Typical applications
– As a stand-alone tool to get insight into data
distribution
– As a preprocessing step for other algorithms

1
Examples of Clustering

• Marketing: Help marketers discover distinct groups in their

customer bases, and then use this knowledge to develop
targeted marketing programs
• Financial application We might wish to find clusters of
companies that have similar financial perfomance
• Insurance: Identifying groups of motor insurance policy
holders with a high average claim cost
• Medical Application: We might wish to find clusters of
patients with similar symptoms.
• Document Retrieval We might wish to find clusters of
documents with related content

2
Good Clustering

• A good clustering method will produce clusters with

– High intra-class similarity
– Low inter-class similarity
• Precise definition of clustering quality is difficult
– Application-dependent
– Ultimately subjective

3
Major Clustering Approaches

• Partitioning: Construct various partitions and then evaluate

them by some criterion, K means, k mediodes.
• Hierarchical: Create a hierarchical decomposition of the set
of objects using some criterion, Example, Agglomerative
• Model-based: Hypothesize a model for each cluster and
find best fit of models to data, Example Expectation
minimization
• Density-based: Guided by connectivity and density
functions, Example DBSCAN, OPTICS, DenClue

4
Partitioning Algorithms

• Partitioning method: Construct a partition of a database D of

n objects into a set of k clusters
• Given a k, find a partition of k clusters that optimizes the
chosen partitioning criterion
– Global optimal: exhaustively enumerate all partitions
– Heuristic methods: k-means and k-medoids algorithms
– k-means (MacQueen, 1967): Each cluster is represented
by the center of the cluster
– k-medoids or PAM (Partition around medoids) (Kaufman
& Rousseeuw, 1987): Each cluster is represented by one
of the objects in the cluster

5
K means clustering, The concept of Center (Centroid)

Assuming that we are using Euclidean distance or something

similar as a measure we can define the centroid of a cluster
to be the point for which each attribute value is the average
of the values of the corresponding attribute for all the points
in the cluster.
So the centroid of the four points (with 6 attributes)

6
K means clustering, The concept of Center (Centroid)

• The centroid of a cluster will sometimes be one of the

points in the cluster, but frequently, as in the above
example, it will be an ‘imaginary’ point, not part of the
cluster itself, which we can take as marking its center.

7
K-Means Clustering

• Given k, the k-means algorithm consists of four steps:

– Select initial centroids at random.

– Assign each object to the cluster with the nearest
centroid.
– Compute each centroid as the mean of the objects
assigned to it.
– Repeat previous 2 steps until no change.

8
K-Means Clustering
Distance Measurement
Example: K-Means Clustering

• We will illustrate the

k-means algorithm
by using it to cluster
the 16 objects with
two attributes x and
y, as shown.

• These points are

shown in a two
dimensional plane
on the next slide.

10
Example: K-Means Clustering

11
Example: K-Means Clustering

Three of the points shown in the

table have been surrounded by
small circles. We will assume that
we have chosen k = 3 and that these
three points have been selected to
be the locations of the initial three
centroids. This initial (fairly
arbitrary) choice is shown in Figure
on the previous slide.

12
Example: K-Means Clustering

13
Example: K-Means Clustering

The columns headed d1,

d2 and d3 in this table
show the Euclidean
distance of each of the 16
points from the three
centroids.

The column headed

‘cluster’ indicates the
centroid closest to each
point and thus the cluster
to which it should be
assigned.
14
Example: K-Means Clustering

The resulting clusters have been shown and they are actual
points within the clusters. The pervious centroids were not
true centroids.

15
Example: K-Means Clustering

We next calculate the centroids of the three clusters using the

x and y values of the objects currently assigned to each one.

The three centroids have all been moved by the assignment

process, but the movement of the third one is appreciably less
than for the other two.
16
Example: K-Means Clustering

The next step is to reassign the 16 objects to the three clusters

by determining which centroid is closest to each one. This
gives the revised set of clusters as shown below. However the
new centroids are not real ones (not actual points). The object
at (8.3, 6.9) has moved from cluster 2 to cluster 1.

17
Example: K-Means Clustering

We next recalculate the positions of the three centroids, giving

the set of new centroids, as shown below. The first two
centroids have moved a little, but the third has not moved at
all.

18
Example: K-Means Clustering

The next step is to assign 16 objects to clusters, again.

These are the same clusters as before. Their centroids will be
the same as those from which the clusters were generated.
Hence the stopping criterion has been met.

19
Example: K-Means Clustering

The three clusters, for the initially (randomly) chosen three

centroids have been formed.

It is now clear that the formation of the clusters is heavily

dependent on the number of centroids as well as the initial
choice of the centroids.

20
Choosing the best possible k

k-Means has no in-built preference for right number of

clusters, following are some of the common ways k can be
selected.
1.Domain Knowledge – If the problem requires/ prefers
certain number or range of clusters, then that can be useful to
select k. For instance, business may prefer three customer
segments H/M/L.

2.Rule of Thumb – Very rough rule of thumb is, where n is

number of data points, but in reality this is never really useful.
This rule gives

21
Choosing the best possible k

3. Cluster Quality using Silhouette Coefficient

The silhouette coefficient is a measure of the compactness and
separation of the clusters. It increases as the quality of the
clusters increase; it is large for compact clusters that are far
from each other and small for large, overlapping clusters.

The silhouette coefficient is calculated per instance; for a set

of instances, it is calculated as the mean of the individual
samples' scores. The silhouette coefficient for an instance is
calculated with the following equation:

22
Silhouette Coefficient

Cluster Quality using Silhouette Coefficient For the ith object

in a cluster A, calculate its average distance to all other
objects in its cluster. This gives us ai.

For the ith object in cluster A and any other object in some
other cluster B, calculate the mean (average) distance from
the ith object in cluster A to all the objects in cluster B.

Find the minimum such value with respect to all clusters. Call
this bi. The Silhouette coefficient for this ith object is given by
Si = (bi - ai)/ max(ai, bi)

An average of all the Silhouette coefficients gives the quality

of the clustering. 23
Choosing the best possible k

Elbow-Method (using Within Cluster Sum of Squares)

1.Compute clustering algorithm (e.g., k-means clustering) for

different values of k. For instance, by varying k from 1 to 10
clusters.
2. For each k, calculate the total within-cluster sum of square (wcss).
3. Plot the curve of wcss according to the number of clusters k.
4. The location of a bend (knee) in the plot is generally considered
as an indicator of the appropriate number of clusters.

24
Choosing the best possible k

Elbow-Method (using Within Cluster Sum of Squares)

25
Advantages of K-Means Clustering

The k-means clustering is popular and widely adopted due to

its simplicity and ease of implementation.

It is efficient and has optimal time complexity defined by

O(ikn), where n is the number of data points, k is the number
of clusters, and i is the number of iterations.

26
Disadvantages of K-Means Clustering

The value of k is always a user input.

This algorithm is applicable only when the means are

available, and in the case of categorical data the centroids are
none other than the frequent values

The clusters identified are very sensitive to the initially

identified centers

k-means is very sensitive to outliers

27
K-means variations

• K-medoids – instead of mean, use

medians of each cluster
– Mean of 1, 3, 5, 7, 9 is5
– Mean of 1, 3, 5, 7, 1009 is205
– Median of 1, 3, 5, 7, 1009 is5
– Median advantage: not affected by extreme
values
k-Medoids
k-Medoids Algorithm
Problem with PAM

• Pam is more robust than k-means in the presence of

noise and outliers
• Pam works efficiently for small data sets but does not
scale well for large data sets.
 Sampling based method,
CLARA(Clustering LARge Applications)

31
CLARA (Clustering Large Applications)

• CLARA (Kaufmann and Rousseeuw in 1990)

• It draws multiple samples of the data set, applies PAM on
each sample, and gives the best clustering as the output
• Strength: deals with larger data sets than PAM
• Weakness:
– Efficiency depends on the sample size
– A good clustering based on samples will not
necessarily represent a good clustering of the whole
data set if the sample is biased

2015 N13 Engine Diagnostic Manual
100% (1)
2015 N13 Engine Diagnostic Manual
2,583 pages
Nptel Swayam DWDM Slides
No ratings yet
Nptel Swayam DWDM Slides
406 pages
Chapter
100% (1)
Chapter
101 pages
StableNet Administrator Manual
100% (1)
StableNet Administrator Manual
122 pages
Predict 422 - Module 8
100% (1)
Predict 422 - Module 8
138 pages
Nemo File Format 2.25 PDF
No ratings yet
Nemo File Format 2.25 PDF
642 pages
Unit 3 Clustering Algorithm
No ratings yet
Unit 3 Clustering Algorithm
44 pages
Unit 5 - Data Mining - WWW - Rgpvnotes.in
No ratings yet
Unit 5 - Data Mining - WWW - Rgpvnotes.in
15 pages
K-Nearest Neighbors: KNN Algorithm Pseudocode
No ratings yet
K-Nearest Neighbors: KNN Algorithm Pseudocode
2 pages
Creating Blue Ocean Strategy in Air Indus (PVT) LTD PDF
No ratings yet
Creating Blue Ocean Strategy in Air Indus (PVT) LTD PDF
182 pages
Data Mining
No ratings yet
Data Mining
27 pages
Estimation and Hypothesis
100% (2)
Estimation and Hypothesis
32 pages
40CrMo EDM
No ratings yet
40CrMo EDM
230 pages
Nexon
No ratings yet
Nexon
34 pages
CVT Diagnostics
100% (3)
CVT Diagnostics
6 pages
Clustering
No ratings yet
Clustering
125 pages
CH 6
No ratings yet
CH 6
72 pages
Data Warehouse: Bilal Hussain
No ratings yet
Data Warehouse: Bilal Hussain
34 pages
U L D R: Nsupervised Earning and Imensionality Eduction
No ratings yet
U L D R: Nsupervised Earning and Imensionality Eduction
58 pages
Topic 1 Etw3482
100% (2)
Topic 1 Etw3482
69 pages
Apriori Algorithm
No ratings yet
Apriori Algorithm
23 pages
Types of Data (Qualitative and Quantitative)
No ratings yet
Types of Data (Qualitative and Quantitative)
89 pages
2 - Finite Fields
No ratings yet
2 - Finite Fields
23 pages
Henny Penny Cfa 500 Users Manual 564984
No ratings yet
Henny Penny Cfa 500 Users Manual 564984
100 pages
Data Warehouse Lec-3
No ratings yet
Data Warehouse Lec-3
38 pages
Chapter 5 Clustering
No ratings yet
Chapter 5 Clustering
40 pages
Concepts and Techniques: Data Mining
100% (1)
Concepts and Techniques: Data Mining
81 pages
Jntuk R20 ML Unit-Ii
No ratings yet
Jntuk R20 ML Unit-Ii
37 pages
Lecture 3 Data Mining
No ratings yet
Lecture 3 Data Mining
30 pages
Data Warehouse: Bilal Hussain
No ratings yet
Data Warehouse: Bilal Hussain
20 pages
Answer 1722791857 NLP and Classification Practical MCQ 4991
No ratings yet
Answer 1722791857 NLP and Classification Practical MCQ 4991
26 pages
Class VIII PT 4
No ratings yet
Class VIII PT 4
2 pages
ICSE G6 - Phy Light Notes
No ratings yet
ICSE G6 - Phy Light Notes
5 pages
Classification and Regression Trees
100% (1)
Classification and Regression Trees
60 pages
BE Project Report
No ratings yet
BE Project Report
20 pages
Lecture 14 Clustering
0% (1)
Lecture 14 Clustering
57 pages
K Means Clustering
100% (1)
K Means Clustering
13 pages
Chapter-V CLASSIFICATION & CLUSTERING
No ratings yet
Chapter-V CLASSIFICATION & CLUSTERING
153 pages
ML QB With Answer
No ratings yet
ML QB With Answer
20 pages
Chapter 7
100% (1)
Chapter 7
31 pages
Introduction To Data Mining: Saeed Salem Department of Computer Science North Dakota State University Cs - Ndsu.edu/ Salem
No ratings yet
Introduction To Data Mining: Saeed Salem Department of Computer Science North Dakota State University Cs - Ndsu.edu/ Salem
30 pages
Lecture Decision Trees
No ratings yet
Lecture Decision Trees
46 pages
DWDM R13 Unit 1 PDF
No ratings yet
DWDM R13 Unit 1 PDF
10 pages
Gec1 Philosophical Perspective
No ratings yet
Gec1 Philosophical Perspective
21 pages
Re-Examination of Motivation I
No ratings yet
Re-Examination of Motivation I
19 pages
Data Mining Clustering
No ratings yet
Data Mining Clustering
76 pages
Kpi Type KPI 3G: Ps Volume Hsdpa Ps Volume Hsupa
No ratings yet
Kpi Type KPI 3G: Ps Volume Hsdpa Ps Volume Hsupa
12 pages
CA Foundation Economics - Notes
No ratings yet
CA Foundation Economics - Notes
87 pages
Market Basket Analysis and Advanced Data Mining: Professor Amit Basu
No ratings yet
Market Basket Analysis and Advanced Data Mining: Professor Amit Basu
24 pages
Lab 10
No ratings yet
Lab 10
19 pages
Expectation Maximization
No ratings yet
Expectation Maximization
23 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
77 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
24 pages
Decision Trees For Predictive Modeling (Neville)
100% (1)
Decision Trees For Predictive Modeling (Neville)
24 pages
Ta-Rw244 Manual e
No ratings yet
Ta-Rw244 Manual e
16 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
47 pages
Lecture 6
No ratings yet
Lecture 6
14 pages
4 - Advance Encryption Standard
No ratings yet
4 - Advance Encryption Standard
33 pages
Agglomerative Hierarchical Clustering
No ratings yet
Agglomerative Hierarchical Clustering
22 pages
Andrew Janiak - Newton and Descartes - Theology and Natural Philosophy
100% (1)
Andrew Janiak - Newton and Descartes - Theology and Natural Philosophy
22 pages
3 - Block Ciphers and The Data Encryption Standard
No ratings yet
3 - Block Ciphers and The Data Encryption Standard
32 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
42 pages
Logistic Regression
100% (1)
Logistic Regression
21 pages
K-Means Clustering
No ratings yet
K-Means Clustering
6 pages
PD957 BP220
No ratings yet
PD957 BP220
14 pages
K-Nearest Neighbor Learning
No ratings yet
K-Nearest Neighbor Learning
19 pages
1 - Introduction To Number Theory
No ratings yet
1 - Introduction To Number Theory
45 pages
Cluster
100% (1)
Cluster
72 pages
Unit - 4 Machine Learning
100% (1)
Unit - 4 Machine Learning
84 pages
K Nearest Neighbors: Probably A Duck."
No ratings yet
K Nearest Neighbors: Probably A Duck."
14 pages
K-Means Clustering
No ratings yet
K-Means Clustering
8 pages
Data Mining Unit 4 (1) PDF PDF
No ratings yet
Data Mining Unit 4 (1) PDF PDF
11 pages
An Introduction To Clustering and Different Methods of Clustering
No ratings yet
An Introduction To Clustering and Different Methods of Clustering
9 pages
Association Rules
No ratings yet
Association Rules
64 pages
MIS NABTEX 2022 Moderated Exam Questions 2022 II - 121740-2
No ratings yet
MIS NABTEX 2022 Moderated Exam Questions 2022 II - 121740-2
7 pages
Cluster Analysis: G Sreenivas
No ratings yet
Cluster Analysis: G Sreenivas
29 pages
3 - Block Ciphers and The Data Encryption Standard Part 2
No ratings yet
3 - Block Ciphers and The Data Encryption Standard Part 2
33 pages
K Means R and Rapid Miner Patient and Mall Case Study
No ratings yet
K Means R and Rapid Miner Patient and Mall Case Study
80 pages
03 - K Means Clustering On Iris Datasets
No ratings yet
03 - K Means Clustering On Iris Datasets
4 pages
PS2 Verilog
No ratings yet
PS2 Verilog
11 pages
Outline: Problem Statement Definitions & Examples Strategies
No ratings yet
Outline: Problem Statement Definitions & Examples Strategies
7 pages
Decision Tree Classifier-Introduction, ID3
No ratings yet
Decision Tree Classifier-Introduction, ID3
34 pages
UV Cure Conformal Coating
No ratings yet
UV Cure Conformal Coating
3 pages
Sandvik Catalog
No ratings yet
Sandvik Catalog
26 pages
Robust High Voltage Cable Joint Design
No ratings yet
Robust High Voltage Cable Joint Design
5 pages
Bar Graph-Wps Office
No ratings yet
Bar Graph-Wps Office
16 pages
STS - Module 1
No ratings yet
STS - Module 1
5 pages
Minion Rush. Hahaha
No ratings yet
Minion Rush. Hahaha
6 pages
Population and Development Integration: A Planning Approach
No ratings yet
Population and Development Integration: A Planning Approach
14 pages
Oracle/PLSQL: Sequences (Autonumber)
No ratings yet
Oracle/PLSQL: Sequences (Autonumber)
3 pages
CF5092 Talleres de Escoriaza SAU - 379770....
No ratings yet
CF5092 Talleres de Escoriaza SAU - 379770....
7 pages
Naïve Bayes Classifier (Week 8)
No ratings yet
Naïve Bayes Classifier (Week 8)
18 pages
Introduction To Tree Methods
No ratings yet
Introduction To Tree Methods
15 pages
Nearest Neighbour Algorithm
No ratings yet
Nearest Neighbour Algorithm
20 pages
Specifications For The IManager U2000 Northbound Interface 07 (20170808)
No ratings yet
Specifications For The IManager U2000 Northbound Interface 07 (20170808)
23 pages
Chapter 2 - Bordon
No ratings yet
Chapter 2 - Bordon
2 pages
Cone of Experience (1946) Was The Most Important Contribution of Edgar Dale in Field of
No ratings yet
Cone of Experience (1946) Was The Most Important Contribution of Edgar Dale in Field of
3 pages
DBSCAN
No ratings yet
DBSCAN
18 pages
Chandigarh Group of Colleges College of Engineering Landran, Mohali
No ratings yet
Chandigarh Group of Colleges College of Engineering Landran, Mohali
47 pages
Chi Merge
No ratings yet
Chi Merge
5 pages
Melc Ia Smaw G7-8
100% (3)
Melc Ia Smaw G7-8
2 pages
Distance-Based Methods - KNN
No ratings yet
Distance-Based Methods - KNN
8 pages
Unit 4
No ratings yet
Unit 4
4 pages
Seminar Report Machine Learning
No ratings yet
Seminar Report Machine Learning
20 pages
BDC - Sap Abap Questionnare
No ratings yet
BDC - Sap Abap Questionnare
6 pages
Cheatsheet Midterms 2 - 3
No ratings yet
Cheatsheet Midterms 2 - 3
2 pages

K Means Clustering Lecture

Uploaded by

K Means Clustering Lecture

Uploaded by

Clustering

• Cluster: a collection of data objects

• Marketing: Help marketers discover distinct groups in their

• A good clustering method will produce clusters with

• Partitioning: Construct various partitions and then evaluate

• Partitioning method: Construct a partition of a database D of

Assuming that we are using Euclidean distance or something

• The centroid of a cluster will sometimes be one of the

• Given k, the k-means algorithm consists of four steps:

– Select initial centroids at random.

• We will illustrate the

• These points are

Three of the points shown in the

The columns headed d1,

The column headed

We next calculate the centroids of the three clusters using the

The three centroids have all been moved by the assignment

The next step is to reassign the 16 objects to the three clusters

We next recalculate the positions of the three centroids, giving

The next step is to assign 16 objects to clusters, again.

The three clusters, for the initially (randomly) chosen three

It is now clear that the formation of the clusters is heavily

k-Means has no in-built preference for right number of

2.Rule of Thumb – Very rough rule of thumb is, where n is

3. Cluster Quality using Silhouette Coefficient

The silhouette coefficient is calculated per instance; for a set

Cluster Quality using Silhouette Coefficient For the ith object

An average of all the Silhouette coefficients gives the quality

Elbow-Method (using Within Cluster Sum of Squares)

1.Compute clustering algorithm (e.g., k-means clustering) for

Elbow-Method (using Within Cluster Sum of Squares)

The k-means clustering is popular and widely adopted due to

It is efficient and has optimal time complexity defined by

The value of k is always a user input.

This algorithm is applicable only when the means are

The clusters identified are very sensitive to the initially

k-means is very sensitive to outliers

• K-medoids – instead of mean, use

• Pam is more robust than k-means in the presence of

• CLARA (Kaufmann and Rousseeuw in 1990)

You might also like