Clustering
Machine Learning
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Data we will work with
• Customer Spend Data
• AVG_Mthly_Spend: The average monthly amount spent by customer
• No_of_Visits: The number of times a customer visited in a month
• Item Counts: Count of Apparel, Fruits and Vegetable, Staple Items purchased
• Can we cluster similar customers together?
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited 2
Connectivity Based: Hierarchical Clustering
• Hierarchical Clustering techniques create clusters in a
hierarchical tree like structure
• Any type of distance measure can be used as a measure of
similarity
• Cluster tree like output is called Dendogram
• Techniques either start with individual objects and
sequentially combine them (Agglomerative ), or start from
one cluster of all objects and sequentially divide them
(Divisive)
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited 3
Agglomerative
• Starts with each object as a cluster of one record each
• Sequentially merges 2 closest records by distance as a
measure of similarity to form a cluster.
• How would we measure distance between two clusters?
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited 4
Distance between clusters
• Single linkage – Minimum
distance or Nearest neighbor
• Complete linkage – Maximum
distance or Farthest distance
• Average linkage – Average of
the distances between all
pairs
• Centroid method – combine
cluster with minimum distance
between the centroids of the
two clusters
• Ward’s method – Combine
clusters with which the
increase in within cluster
variance is to the smallest
degree
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited 5
Distance between objects
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited 6
Centroid based: K-Means Clustering
• K-Means is probably the most used clustering technique
• Aims to partition the n observations into k clusters so as to
minimize the within-cluster sum of squares (i.e. variance).
• Computationally less expensive compared to hierarchical
techniques.
• Have to pre-define K, the no of clusters
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited 7
Lloyd’s algorithm
1. Assume K Centroids
2. Compute Squared Eucledian distance of each objects with
these K centroids. Assign each to the closest centroid forming
clusters.
3. Compute the new centroid (mean) of each cluster based on
the objects assigned to each clusters.
4. Repeat 2 and 3 till convergence: usually defined as the point
at which there is no movement of objects between clusters
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited 8
Choosing the optimal K
• Usually subjective, based on striking a good balance between
compression and accuracy
•The “elbow” method is commonly used
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited 9
Lloyd’s algorithm
1. Assume K Centroids
2. Compute Squared Eucledian distance of each objects with
these K centroids. Assign each to the closest centroid forming
clusters.
3. Compute the new centroid (mean) of each cluster based on
the objects assigned to each clusters.
4. Repeat 2 and 3 till convergence: usually defined as the point
at which there is no movement of objects between clusters
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited 10
K-Means
Strengths Weakness
Use simple principles without the need for any How to choose K?
complex statistical terms
Once clusters and their associated centroids are The k-means algorithm is sensitive to the
identified, it is easy to assign new objects (for starting positions of the initial centroid. Thus, it
example, new customers) to a cluster based on is important to rerun the k-means analysis
the object's distance from the closest centroid several times for a particular value of k to
ensure the cluster results provide the overall
minimum WSS
Because the method is unsupervised, using k- Susceptible to curse of dimensionality
means helps to eliminate subjectivity from the
analysis. Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited 11
Visual Analysis of Clustering
• Visual analysis of the attributes selected for the clustering may given
an idea of the range of values that K should be evaluated in
• Identifying the attributes on which clusters are clearly demarcated and
using them in incremental order to build the multi-dimensional clusters
likely to give much better clusters than using all the attributes at one go
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited 12
Dynamic Clustering
• Clustering on correct attributes is the key to good clustering
results.
• We can also consider those attributes who's value changes with
time. For e.g. age, income category, years of work experience
etc.
• We can use sequential k means clustering over time to track
individual clusters (how they change in size, shape and position
• We can also understand how individual data points move across
clusters, form new clusters etc.
• Analyzing the changes in the clusters over time using metrics
such as
• Cluster size, new entries and exits one can analyze the impact of
strategies designed based on earlier clustering analysis
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited 13