Machine Learning
Sunbeam Infotech www.sunbeaminfo.com
unlabelia dura
-
finding groups/ clusters
--
Clustering
-
-
Explorative Data Analysis
Sunbeam Infotech www.sunbeaminfo.com
Overview unlabelled
⑧
data
§ Clustering is one of the most common exploratory data analysis technique used to get an intuition
a n
about the structure of the data mecasters, erosite
-
§ It can be defined as the task of identifying subgroups in the data such that data points in the same
-
subgroup (cluster) are very similar while data points in different clusters are very different
-
§ In other words, we try to find homogeneous subgroups within the data such that data points in each
- e n te
cluster are as similar as possible according to a similarity measure such as euclidean-based distance
- -
-
or correlation-based distance
e n
dance ↑
§ The decision of which similarity measure to use is application-specific
-
§ Unlike supervised learning, clustering is considered an unsupervised learning method since we don’t
-
have the ground truth to compare the output of the clustering algorithm to the true labels to evaluate
its performance ↳ concrete
e
↳
-
outputvariable
↳ dependent variable
↳Y
Sunbeam Infotech www.sunbeaminfo.com
Overview
⑧
atthe e
Preprocess
②
Select similarity
③
Cluster
④
Analyze
i1 nvariable inte.Austeneer
data measure e
I I convert the
-> data cleaning ->
clusters to classes
processing
visualize the clusters
↳ Saling ->
Sunbeam Infotech www.sunbeaminfo.com
Use Cases
§ Marketing
-
§ Discovering groups in customer databases like who makes long-distance calls or who are earning more or
- s e e
who are spending more
e n
§ Insurance
-
§ Identifying groups of insurance policy holder with high claim rate
-
§ Land use
-
§ Identification of areas of similar land use in GIS database
-
↳ city
↳ State
↳
country
Sunbeam Infotech www.sunbeaminfo.com
Types
-
-
§-
Exclusive clustering
§ An item belongs exclusively to one cluster and not several
e n t e
§ E.g. K-Means clustering
-
-
⑲ge
age<50 age>,50
Cl (2
Sunbeam Infotech www.sunbeaminfo.com
Types
§ Overlapping clustering
-
§ An Item can belong to multiple clusters
n e e
§ Its degree of association with each cluster is known
-
§ E.g. Fuzzy/C-means clustering
-
(2
Cl
Sunbeam Infotech www.sunbeaminfo.com
Types
§ Hierarchical clustering
-re
-
-
§ When two clusters have a parent child relationship
n e e
§ It forms a tree like structure
n e e
§ E.g. Hierarchical clustering cluster with
-
two smaller clusters
dendograph
⑧
->
-
A
root
C1
cluster with 4
small clusters
cluster
~
->
one item
with
I ⑪ ⑰ C2
-
⑳ C3 18
->
8
i2 I7
14
IG
I5
Sunbeam Infotech www.sunbeaminfo.com
KMeans
--
Sunbeam Infotech www.sunbeaminfo.com
Overview
perform actions multiple times
to
§ Kmeans algorithm is an iterative algorithm that tries to partition the dataset into distinct non-
-
overlapping subgroups (clusters) where each data point belongs to only one group
-
§ It tries to make the inter-cluster data points as similar as possible while also keeping the clusters as
e
-n
different (far) as possible
-
§ It assigns data points to a cluster such that the sum of the squared distance between the data points
-
and the cluster’s centroid (arithmetic mean of all the data points that belong to that cluster) is at the
-
-
-
minimum
§ The less variation we have within clusters, the more homogeneous (similar) the data points are within
- -
the same cluster
-
similarity
measure
a
=distance
Sunbeam Infotech www.sunbeaminfo.com
How does it work?
-
§ Specify number of clusters K
§ Initialize centroids by first shuffling the dataset and then randomly selecting K data points for the
-
centroids without replacement
§ Keep iterating until there is no change to the centroids. i.e assignment of data points to clusters isn’t
--
-
changing
§ Compute the sum of the squared distance between data points and all centroids
-
§ Assign each data point to the closest cluster (centroid)
-
§ Compute the centroids for the clusters by taking the average of the all data points that belong to each
-
cluster
Sunbeam Infotech www.sunbeaminfo.com
-
I
Cl
.
C2
C2
⑮
18
:s:::::
2
x X ·o
x
x
6 8
⑮ "
..
.
.
K-Means Clustering - Algorithm
§ Initialization
-
§ randomly initialise two points called the cluster centroids
-
-
its notrequired that
the initial centroids
dataset
are the
party CI
dataset ⑧
-
You may select centroids outside of
well
as
⑧
(2
Sunbeam Infotech www.sunbeaminfo.com
K-Means Clustering - Algorithm
§ Cluster Assignment
-
§ Compute the distance between both the points and
-
centroids
-
§ Depending on the minimum distance from the centroid
-
divide the points into two clusters
-
⑧
%=
-
·
(2
Sunbeam Infotech www.sunbeaminfo.com
K-Means Clustering - Algorithm
§ Move Centroid
- -
§ Consider the older centroids are data points
-
§ Take
-
the older centroid and iteratively reposition them
for optimization
-
§ Optimization
-
§ Repeat the steps until the cluster centroids stop
-
changing the position Cl
- ⑧
892
Sunbeam Infotech www.sunbeaminfo.com
K-Means Clustering - Algorithm
§ Convergence
e
§ Finally, k-means clustering algorithm converges and divides the data points into two clusters clearly visible in
-
multiple clusters
-
⑭
O
Sunbeam Infotech www.sunbeaminfo.com
K-Means Clustering - Example
§ Suppose we want to group the visitors to a website using just their age (one-dimensional space) as
-
follows:
15,15,16,19,19,20,20,21,22,28,35,40,41,42,43,44,60,61,65
N = 19
Sunbeam Infotech www.sunbeaminfo.com
K-Means Clustering - Example
§ Initial clusters (random centroid or average)
-
&
k=2
c1 = 16
-
c2 = 22
-
e
--
2
-
-
Sunbeam Infotech www.sunbeaminfo.com
K-Means Clustering - Example
age lizei-cil (n 181
=
§ Iteration I
Before:
-
W
c1 = 16
-
c2 = 22
-
After:
c1 = 15.33
-
c2 = 36.25
--
Sunbeam Infotech www.sunbeaminfo.com
K-Means Clustering - Example
/xi-cil (xi-cel
§ Iteration II
a 0
I
⑧
o
O
Before:
"
c1 = 15.33
E.
c2 = 36.25
=8
After:
-I
c1 = 18.56
c2 = 45.9
o
Sunbeam Infotech www.sunbeaminfo.com
K-Means Clustering - Example
§ Iteration III
Before:
=)
c1 = 18.56
c2 = 45.9
After:
c1 = 19.50
c2 = 47.89
Sunbeam Infotech www.sunbeaminfo.com
K-Means Clustering - Example
§ Iteration IV
SO
Before:
c1 = 19.50 Cl
-
0
c2 = 47.89
-
After:
=
c1 = 19.50
-
c2 = 47.89
0
Sunbeam Infotech www.sunbeaminfo.com
K-Means Clustering
§ How to find the optimum number of clusters?
-
-§
Elbow Method
-
- -
§ Purpose Method (Intuition C
-
Sunbeam Infotech www.sunbeaminfo.com
Elbow Method
-
-
§ Total within-cluster variation
-
-
↑
sum
of distance
squares between
every
point & cluster's centroid
§ Also known as Within Sum of Squares (WSS)
e
§ The sum of squared distances (Euclidean) between the items and the corresponding centroid
-
↳ calculate cluster all
of
for men
all clusters
C =)
e
I
d
↑
-
points within
the cluster
Sunbeam Infotech www.sunbeaminfo.com
Elbow Method
-
§ Draw a curve between WSS (within sum of squares) and
-
-
the number of clusters
§ It is called elbow method because the curve looks like a
-
human arm and the elbow point gives us the optimum
-
number of clusters
I
I ↑...
-
- I
---
Elbor
I
⑧
·
Sunbeam Infotech www.sunbeaminfo.com
Purpose Method
-
§ Get different clusters based on a variety of purposes
-
§ Partition the data on different metrics and see how well it performs for that particular case
-n e e
-
V
/
O
⑧ 0
§ K=3: If you want to provide only 3 sizes(S, M, L) so that prices are cheaper, you will divide the data set into 3 clusters.
-see -
§ K=5: Now, if you want to provide more comfort and variety to your customers with more sizes (XS, S, M, L, XL), then you will
-
-
divide the data set into 5 clusters.
e
Sunbeam Infotech www.sunbeaminfo.com
Applications
§ It is very popular and used in a variety of applications such as market segmentation, document
- - -
clustering, image segmentation and image compression, etc.
-
-
Sunbeam Infotech www.sunbeaminfo.com
Disadvantages
§ It always try to construct a nice spherical shape around the centroid. That means, the minute the
clusters have a complicated geometric shapes, kmeans does a poor job in clustering the data
§ Doesn’t let data points that are far-away from each other share the same cluster even though they
obviously belong to the same cluster
Sunbeam Infotech www.sunbeaminfo.com
Hierarchical Clustering
-
-
Sunbeam Infotech www.sunbeaminfo.com
Hierarchical Clustering
§ Separating data into different groups based on some measure of similarity
-stance
§ Types
§ Agglomerative
-
§ Divisive
-
Sunbeam Infotech www.sunbeaminfo.com
⑳
Linkage ge
Bottom-up
-
Dendegsam
-
⑳
x
33
-
biggest the
· T
fot-------
- -
I I
I 1-
⑧I
I
! I
⑪ ⑧ ⑤ B E
C D E
N
.
Hierarchical Clustering
e n
§ Dendrogram
-
§ diagram that shows the hierarchical relationship between objects
-
⑳ Roof
parent
↓ parent
.
Sunbeam Infotech www.sunbeaminfo.com
Agglomerative Clustering
-
§ Also called as bottom-top clustering as it uses bottom-up approach
-
§ Each data point starts in its own cluster
-
§ These clusters are then joined greedily by taking two most similar clusters together
-
Sunbeam Infotech www.sunbeaminfo.com
Agglomerative Clustering
-
§ Start by assigning each item to a cluster
-
§ if you have N items, you now have N clusters, each containing just one item
- e
§ Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you
- -
have one cluster less
-
§ Compute distances (similarities) between the new cluster and each of the old clusters
-
§ Repeat steps 2 and 3 until all items are clustered into a single cluster of size N
-
↳ Root
-
cluster
-
Sunbeam Infotech www.sunbeaminfo.com
Agglomerative Clustering
-
§ Step 3 can be done in different ways
e
§ Single-linkage
-
§ Complete-linkage
n e
§ Average-linkage
-
↳ how the clusters need to
be merged
Sunbeam Infotech www.sunbeaminfo.com
Agglomerative Clustering
§ Single linkage
-
§ Also known as nearest neighbour clustering
-
§ The
-
distance between two groups is defined as the distance between their two closest members
§ It often yields clusters in which individuals are added sequentially to a single group
-
o C
------
0.......
no
......
Sunbeam Infotech www.sunbeaminfo.com
Agglomerative Clustering
§ Complete linkage
-
§ also known as furthest neighbour clustering
§ the
-
distance between two groups as the distance between their two farthest-apart members
&
·C
⑱----
complete linkage
⑪
⑳-------- 13
·
⑧
D
Sunbeam Infotech www.sunbeaminfo.com
Agglomerative Clustering
§ Average linkage
-
-
§ referred to as the unweighted pair-group method
-
§ distance between two groups is defined as the average distance between each of their members
-
⑲or
i
0
Sunbeam Infotech www.sunbeaminfo.com
Divisive Clustering
-
§ Also called as top-bottom clustering as it uses top-bottom approach
-
§ All data point starts in it’s the same cluster (Roo)
-
§ Then using parametric clustering like k-means divide the cluster into multiple clusters
-
§ For each cluster repeating the process find sub cluster till the desired number of clusters found
-
Sunbeam Infotech www.sunbeaminfo.com