[go: up one dir, main page]

0% found this document useful (0 votes)
60 views13 pages

Slides - Clustering

This is an introduction to Machine learning that allows retailers understand the importance of having some requirements in the same place or next to each other to yield a better turnover.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views13 pages

Slides - Clustering

This is an introduction to Machine learning that allows retailers understand the importance of having some requirements in the same place or next to each other to yield a better turnover.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Clustering

Machine Learning

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Data we will work with

• Customer Spend Data


• AVG_Mthly_Spend: The average monthly amount spent by customer
• No_of_Visits: The number of times a customer visited in a month
• Item Counts: Count of Apparel, Fruits and Vegetable, Staple Items purchased

• Can we cluster similar customers together?


Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited 2
Connectivity Based: Hierarchical Clustering

• Hierarchical Clustering techniques create clusters in a


hierarchical tree like structure

• Any type of distance measure can be used as a measure of


similarity

• Cluster tree like output is called Dendogram

• Techniques either start with individual objects and


sequentially combine them (Agglomerative ), or start from
one cluster of all objects and sequentially divide them
(Divisive)

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited 3
Agglomerative

• Starts with each object as a cluster of one record each

• Sequentially merges 2 closest records by distance as a


measure of similarity to form a cluster.

• How would we measure distance between two clusters?

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited 4
Distance between clusters

• Single linkage – Minimum


distance or Nearest neighbor

• Complete linkage – Maximum


distance or Farthest distance

• Average linkage – Average of


the distances between all
pairs

• Centroid method – combine


cluster with minimum distance
between the centroids of the
two clusters

• Ward’s method – Combine


clusters with which the
increase in within cluster
variance is to the smallest
degree
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited 5
Distance between objects

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited 6
Centroid based: K-Means Clustering

• K-Means is probably the most used clustering technique

• Aims to partition the n observations into k clusters so as to


minimize the within-cluster sum of squares (i.e. variance).

• Computationally less expensive compared to hierarchical


techniques.

• Have to pre-define K, the no of clusters

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited 7
Lloyd’s algorithm

1. Assume K Centroids

2. Compute Squared Eucledian distance of each objects with


these K centroids. Assign each to the closest centroid forming
clusters.

3. Compute the new centroid (mean) of each cluster based on


the objects assigned to each clusters.

4. Repeat 2 and 3 till convergence: usually defined as the point


at which there is no movement of objects between clusters

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited 8
Choosing the optimal K

• Usually subjective, based on striking a good balance between


compression and accuracy

•The “elbow” method is commonly used

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited 9
Lloyd’s algorithm

1. Assume K Centroids

2. Compute Squared Eucledian distance of each objects with


these K centroids. Assign each to the closest centroid forming
clusters.

3. Compute the new centroid (mean) of each cluster based on


the objects assigned to each clusters.

4. Repeat 2 and 3 till convergence: usually defined as the point


at which there is no movement of objects between clusters

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited 10
K-Means

Strengths Weakness

Use simple principles without the need for any How to choose K?

complex statistical terms

Once clusters and their associated centroids are The k-means algorithm is sensitive to the

identified, it is easy to assign new objects (for starting positions of the initial centroid. Thus, it

example, new customers) to a cluster based on is important to rerun the k-means analysis

the object's distance from the closest centroid several times for a particular value of k to

ensure the cluster results provide the overall

minimum WSS

Because the method is unsupervised, using k- Susceptible to curse of dimensionality

means helps to eliminate subjectivity from the

analysis. Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited 11
Visual Analysis of Clustering

• Visual analysis of the attributes selected for the clustering may given
an idea of the range of values that K should be evaluated in

• Identifying the attributes on which clusters are clearly demarcated and


using them in incremental order to build the multi-dimensional clusters
likely to give much better clusters than using all the attributes at one go

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited 12
Dynamic Clustering

• Clustering on correct attributes is the key to good clustering


results.

• We can also consider those attributes who's value changes with


time. For e.g. age, income category, years of work experience
etc.

• We can use sequential k means clustering over time to track


individual clusters (how they change in size, shape and position

• We can also understand how individual data points move across


clusters, form new clusters etc.

• Analyzing the changes in the clusters over time using metrics


such as

• Cluster size, new entries and exits one can analyze the impact of
strategies designed based on earlier clustering analysis

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited 13

You might also like