Unsupervised Machine Learning -
Clustering
May 2020
SECRET
Knowledge Share –Session plan
Topic Application Schedule
Overview of Machine learning and feature selection Generic 19-Feb
Regression - Supervised Machine Learning Market Share Forecast/ Inventory 13-Mar
Obsolescence
Classification - Supervised Machine Learning Technician Attrition 10-Apr
Clustering - Unsupervised Machine Learning Dealer/Parts Clustering 8-May
Bagging & Boosting - Ensemble Methods Service Parts Forecasting 5-Jun
Genetic Algorithm -Reinforcement Learning Vehicle Route optimization 3-Jul
Linear programming and mathematical optimization Container Loading/Vanning 31-Jul
Dimension Reduction & Pattern Search - Generic 28-Aug
Unsupervised Machine Learning
Descriptive, Predictive & prescriptive Analytics
2
SECRET
Machine Learning Universe
SECRET 3
What is clustering?
• The organization of unlabeled data into similarity groups
called clusters.
• A cluster is a collection of data items which are “similar”
between them, and “dissimilar” to data items in other clusters.
Historic application of clustering
SECRET 5
Clustering techniques
Divisive
K-means
K-Means clustering
• K-means (MacQueen, 1967) is a partitional clustering algorithm
• Let the set of data points D be {x1, x2, …, xn},
where xi = (xi1, xi2, …, xir) is a vector in X Rr, and r is the
number of dimensions.
• The k-means algorithm partitions the given data into
k clusters:
– Each cluster has a cluster center, called centroid.
– k is specified by the user
K-means clustering example: step 1
SECRET 8
K-means clustering example – step 2
SECRET 9
K-means clustering example – step 3
SECRET 10
K-means clustering example
SECRET 11
K-means clustering example
SECRET 12
K-means clustering example
SECRET 13
Weaknesses of K-means
• The algorithm is only applicable if the mean is
defined.
– For categorical data, k-mode - the centroid is
represented by most frequent values.
• The user needs to specify k.
• Sensitive to initial seed
• The algorithm is sensitive to outliers
– Outliers are data points that are very far away
from other data points.
– Outliers could be errors in the data recording or
some special data points with very different values.
Optimal Number of cluster
Within Cluster Sum of Squares (WCSS)
Optimal Number of cluster
Sensitivity to initial seeds
Random selection of seeds (centroids) Random selection of seeds (centroids)
Iteration 1 Iteration 2 Iteration 1 Iteration 2
Outlier
s
SECRET 18
K-means summary
• Despite weaknesses, k-means is still the most
popular algorithm due to its simplicity and
efficiency
• No clear evidence that any other clustering
algorithm performs better in general
• Comparing different clustering algorithms is a
difficult task. No one knows the correct
clusters!