Python for Data Science
Machine Learning in Python:
Clustering
Dr. Ilkay Altintas and Dr. Leo Porter
Twitter: #UCSDpython4DS
By the end of this video, you should be able to:
Python for Data Science
§ Articulate the goal of cluster analysis
§ Discuss whether cluster analysis is supervised or
unsupervised
§ List some ways that cluster results can be applied
Cluster Analysis Overview
Python for Data Science
Goal: Organize similar items into groups
Cluster Analysis Examples
Python for Data Science
• Segment customer base into groups
• Characterize different weather patterns
for a region
• Group news articles into topics
• Discover crime hot spots
Cluster Analysis
• Divides data into clusters
Python for Data Science
• Similar items are placed in same cluster
Intra-cluster
differences are
minimized
Inter-cluster differences are
v maximized
Similarity Measures
A A
Python for Data Science
B B
Euclidean Distance Manhattan Distance
Cosine Similarity
Normalizing Input Variables
Python for Data Science
Scaled Values
Weight
Height
Cluster Analysis Notes
Python for Data Science
Unsupervised
There is no ‘correct’
clustering
Clusters don’t come
with labels
Interpretation and analysis required to
make sense of clustering results!
Uses of Cluster Results
• Data segmentation
Python for Data Science
• Analysis of each segment can provide insights
science fiction
non-fiction
children’s
Uses of Cluster Results
• Categories for classifying new data
Python for Data Science
• New sample assigned to closest cluster
Label of closest
cluster used to
classify new
sample
Uses of Cluster Results
• Labeled data for classification
Python for Data Science
• Cluster samples used as labeled data
Labeled samples
for science fiction
customers
Uses of Cluster Results
• Basis for anomaly detection
Python for Data Science
• Cluster outliers are anomalies
Anomalies that
require further
v analysis
Cluster Analysis Summary
• Organize similar items into groups
Python for Data Science
• Analyzing clusters often leads to useful
insights about data
• Clusters require analysis and interpretation
Python for Data Science
Machine Learning in Python:
k-Means Clustering
Dr. Ilkay Altintas and Dr. Leo Porter
Twitter: #UCSDpython4DS
By the end of this video, you should be able to:
Python for Data Science
§ Describe the steps in the k-means algorithm
§ Explain what the ‘k’ stands for in k-means
§ Define cluster centroid
Cluster Analysis
• Divides data into clusters
Python for Data Science
• Similar items are in same cluster
Intra-cluster
differences are
minimized
Inter-cluster differences are
maximized
k-Means Algorithm
Select k initial centroids (cluster centers)
Python for Data Science
Repeat
Assign each sample to closest centroid
Calculate mean of cluster to determine new centroid
Until some stopping criterion is reached
centroid
X
(a) (b) (c)
X X k-Means
Python for Data Science
X X
Original samples Initial centroids Assign samples
(d) (e) (f)
X
X X X X
X
Re-calculate centroids Assign samples Re-calculate centroids
Choosing Initial Centroids
Issue:
Python for Data Science
Final clusters are sensitive to initial centroids
Solution:
Run k-means multiple times with
different random initial centroids,
and choose best results
Evaluating Cluster Results
error = distance between sample & centroid
Python for Data Science
X squared error = error2
Sum of squared errors between all
samples & centroid
Sum over all clusters WSSE
Within-Cluster Sum of
Squared Error
Using WSSE
Python for Data Science
WSSE1 < WSSE2 WSSE1 is better numerically
Caveats:
• Does not mean that cluster set 1 is
more ‘correct’ than cluster set 2
• Larger values for k will always reduce
WSSE
Choosing Value for k
• Approaches: k=?
Python for Data Science
• Visualization
• Application-Dependent
• Data-Driven
Elbow Method for Choosing k
“Elbow” suggests value for
Python for Data Science
k should be 3
Stopping Criteria
X
Python for Data Science
When to stop iterating?
• No changes to centroids
• Number of samples changing clusters
is below threshold
Interpreting Results
• Examine cluster centroids
Python for Data Science
• How are clusters different?
X
X Compare centroids
to see how clusters
are different
X
K-Means Summary
• Classic algorithm for cluster analysis
Python for Data Science
• Simple to understand and implement
and is efficient
• Value of k must be specified
• Final clusters are sensitive to initial
centroids