[go: up one dir, main page]

0% found this document useful (0 votes)
289 views32 pages

K Means Clustering Lecture

Clustering is the process of grouping unlabeled data points into clusters so that objects within the same cluster are more similar to each other than objects in different clusters. K-means clustering is a commonly used partitioning clustering algorithm that groups data points into k number of clusters defined by the user. It works by assigning each data point to the nearest cluster centroid, recalculating the centroid positions, and repeating this process until the centroids are stable or the maximum number of iterations is reached.

Uploaded by

Daneil Radcliffe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
289 views32 pages

K Means Clustering Lecture

Clustering is the process of grouping unlabeled data points into clusters so that objects within the same cluster are more similar to each other than objects in different clusters. K-means clustering is a commonly used partitioning clustering algorithm that groups data points into k number of clusters defined by the user. It works by assigning each data point to the nearest cluster centroid, recalculating the centroid positions, and repeating this process until the centroids are stable or the maximum number of iterations is reached.

Uploaded by

Daneil Radcliffe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 32

Clustering

Introduction

• Cluster: a collection of data objects


– Similar to one another within the same cluster
– Dissimilar to the objects in other clusters
• Clustering is unsupervised classification:
no predefined classes
• Typical applications
– As a stand-alone tool to get insight into data
distribution
– As a preprocessing step for other algorithms

1
Examples of Clustering

• Marketing: Help marketers discover distinct groups in their


customer bases, and then use this knowledge to develop
targeted marketing programs
• Financial application We might wish to find clusters of
companies that have similar financial perfomance
• Insurance: Identifying groups of motor insurance policy
holders with a high average claim cost
• Medical Application: We might wish to find clusters of
patients with similar symptoms.
• Document Retrieval We might wish to find clusters of
documents with related content

2
Good Clustering

• A good clustering method will produce clusters with


– High intra-class similarity
– Low inter-class similarity
• Precise definition of clustering quality is difficult
– Application-dependent
– Ultimately subjective

3
Major Clustering Approaches

• Partitioning: Construct various partitions and then evaluate


them by some criterion, K means, k mediodes.
• Hierarchical: Create a hierarchical decomposition of the set
of objects using some criterion, Example, Agglomerative
• Model-based: Hypothesize a model for each cluster and
find best fit of models to data, Example Expectation
minimization
• Density-based: Guided by connectivity and density
functions, Example DBSCAN, OPTICS, DenClue

4
Partitioning Algorithms

• Partitioning method: Construct a partition of a database D of


n objects into a set of k clusters
• Given a k, find a partition of k clusters that optimizes the
chosen partitioning criterion
– Global optimal: exhaustively enumerate all partitions
– Heuristic methods: k-means and k-medoids algorithms
– k-means (MacQueen, 1967): Each cluster is represented
by the center of the cluster
– k-medoids or PAM (Partition around medoids) (Kaufman
& Rousseeuw, 1987): Each cluster is represented by one
of the objects in the cluster

5
K means clustering, The concept of Center (Centroid)

Assuming that we are using Euclidean distance or something


similar as a measure we can define the centroid of a cluster
to be the point for which each attribute value is the average
of the values of the corresponding attribute for all the points
in the cluster.
So the centroid of the four points (with 6 attributes)

is

6
K means clustering, The concept of Center (Centroid)

• The centroid of a cluster will sometimes be one of the


points in the cluster, but frequently, as in the above
example, it will be an ‘imaginary’ point, not part of the
cluster itself, which we can take as marking its center.

7
K-Means Clustering

• Given k, the k-means algorithm consists of four steps:

– Select initial centroids at random.


– Assign each object to the cluster with the nearest
centroid.
– Compute each centroid as the mean of the objects
assigned to it.
– Repeat previous 2 steps until no change.

8
K-Means Clustering
Distance Measurement
Example: K-Means Clustering

• We will illustrate the


k-means algorithm
by using it to cluster
the 16 objects with
two attributes x and
y, as shown.

• These points are


shown in a two
dimensional plane
on the next slide.

10
Example: K-Means Clustering

11
Example: K-Means Clustering

Three of the points shown in the


table have been surrounded by
small circles. We will assume that
we have chosen k = 3 and that these
three points have been selected to
be the locations of the initial three
centroids. This initial (fairly
arbitrary) choice is shown in Figure
on the previous slide.

12
Example: K-Means Clustering

13
Example: K-Means Clustering

The columns headed d1,


d2 and d3 in this table
show the Euclidean
distance of each of the 16
points from the three
centroids.

The column headed


‘cluster’ indicates the
centroid closest to each
point and thus the cluster
to which it should be
assigned.
14
Example: K-Means Clustering

The resulting clusters have been shown and they are actual
points within the clusters. The pervious centroids were not
true centroids.

15
Example: K-Means Clustering

We next calculate the centroids of the three clusters using the


x and y values of the objects currently assigned to each one.

The three centroids have all been moved by the assignment


process, but the movement of the third one is appreciably less
than for the other two.
16
Example: K-Means Clustering

The next step is to reassign the 16 objects to the three clusters


by determining which centroid is closest to each one. This
gives the revised set of clusters as shown below. However the
new centroids are not real ones (not actual points). The object
at (8.3, 6.9) has moved from cluster 2 to cluster 1.

17
Example: K-Means Clustering

We next recalculate the positions of the three centroids, giving


the set of new centroids, as shown below. The first two
centroids have moved a little, but the third has not moved at
all.

18
Example: K-Means Clustering

The next step is to assign 16 objects to clusters, again.


These are the same clusters as before. Their centroids will be
the same as those from which the clusters were generated.
Hence the stopping criterion has been met.

19
Example: K-Means Clustering

The three clusters, for the initially (randomly) chosen three


centroids have been formed.

It is now clear that the formation of the clusters is heavily


dependent on the number of centroids as well as the initial
choice of the centroids.

20
Choosing the best possible k

k-Means has no in-built preference for right number of


clusters, following are some of the common ways k can be
selected.
1.Domain Knowledge – If the problem requires/ prefers
certain number or range of clusters, then that can be useful to
select k. For instance, business may prefer three customer
segments H/M/L.

2.Rule of Thumb – Very rough rule of thumb is, where n is


number of data points, but in reality this is never really useful.
This rule gives

21
Choosing the best possible k

3. Cluster Quality using Silhouette Coefficient


The silhouette coefficient is a measure of the compactness and
separation of the clusters. It increases as the quality of the
clusters increase; it is large for compact clusters that are far
from each other and small for large, overlapping clusters.

The silhouette coefficient is calculated per instance; for a set


of instances, it is calculated as the mean of the individual
samples' scores. The silhouette coefficient for an instance is
calculated with the following equation:

22
Silhouette Coefficient

Cluster Quality using Silhouette Coefficient For the ith object


in a cluster A, calculate its average distance to all other
objects in its cluster. This gives us ai.

For the ith object in cluster A and any other object in some
other cluster B, calculate the mean (average) distance from
the ith object in cluster A to all the objects in cluster B.

Find the minimum such value with respect to all clusters. Call
this bi. The Silhouette coefficient for this ith object is given by
Si = (bi - ai)/ max(ai, bi)

An average of all the Silhouette coefficients gives the quality


of the clustering. 23
Choosing the best possible k

Elbow-Method (using Within Cluster Sum of Squares)

1.Compute clustering algorithm (e.g., k-means clustering) for


different values of k. For instance, by varying k from 1 to 10
clusters.
2. For each k, calculate the total within-cluster sum of square (wcss).
3. Plot the curve of wcss according to the number of clusters k.
4. The location of a bend (knee) in the plot is generally considered
as an indicator of the appropriate number of clusters.

24
Choosing the best possible k

Elbow-Method (using Within Cluster Sum of Squares)

25
Advantages of K-Means Clustering

The k-means clustering is popular and widely adopted due to


its simplicity and ease of implementation.

It is efficient and has optimal time complexity defined by


O(ikn), where n is the number of data points, k is the number
of clusters, and i is the number of iterations.

26
Disadvantages of K-Means Clustering

The value of k is always a user input.

This algorithm is applicable only when the means are


available, and in the case of categorical data the centroids are
none other than the frequent values

The clusters identified are very sensitive to the initially


identified centers

k-means is very sensitive to outliers

27
K-means variations

• K-medoids – instead of mean, use


medians of each cluster
– Mean of 1, 3, 5, 7, 9 is5
– Mean of 1, 3, 5, 7, 1009 is205
– Median of 1, 3, 5, 7, 1009 is5
– Median advantage: not affected by extreme
values
k-Medoids
k-Medoids Algorithm
Problem with PAM

• Pam is more robust than k-means in the presence of


noise and outliers
• Pam works efficiently for small data sets but does not
scale well for large data sets.
 Sampling based method,
CLARA(Clustering LARge Applications)

31
CLARA (Clustering Large Applications)

• CLARA (Kaufmann and Rousseeuw in 1990)


• It draws multiple samples of the data set, applies PAM on
each sample, and gives the best clustering as the output
• Strength: deals with larger data sets than PAM
• Weakness:
– Efficiency depends on the sample size
– A good clustering based on samples will not
necessarily represent a good clustering of the whole
data set if the sample is biased

32

You might also like