3/26/25, 6:06 PM Copy of 09-clustering.
ipynb - Colab
Open in Colab
After clicking the "Open in Colab" link, copy the notebook to your own Google Drive before getting started, or it will not save your work
One of the most straightforward tasks we can perform on a data set without labels is to find groups of data in our dataset which are similar
to one another -- what we call clusters.
K-Means is one of the most popular "clustering" algorithms. K-means stores k centroids that it uses to define clusters. A point is considered
to be in a particular cluster if it is closer to that cluster's centroid than any other centroid.
K-Means finds the best centroids by alternating between (1) assigning data points to clusters based on the current centroids (2) chosing
centroids (points which are the center of a cluster) based on the current assignment of data points to clusters.
keyboard_arrow_down Dataset: MNIST
keyboard_arrow_down MNIST Dataset: A Brief Overview
Introduction
The MNIST (Modified National Institute of Standards and Technology) dataset is one of the most widely used datasets in machine learning,
particularly for benchmarking image classification algorithms.
Description
Type: Handwritten digit dataset
Size: 70,000 grayscale images (28x28 pixels)
Training set: 60,000 images
Test set: 10,000 images
Classes: 10 (Digits 0-9)
Format: Each image is a 28×28 matrix with pixel values ranging from 0 (black) to 255 (white).
Why MNIST?
Simple yet challenging enough for ML research.
Well-structured and preprocessed, eliminating the need for extensive data cleaning.
Accessing MNIST
Available in popular ML libraries:
TensorFlow/Keras: tf.keras.datasets.mnist
PyTorch: torchvision.datasets.MNIST
Scikit-learn: fetch_openml('mnist_784')
from sklearn.datasets import fetch_openml
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.cluster.hierarchy as sch
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.metrics import silhouette_score
from sklearn.metrics import adjusted_rand_score
from sklearn.manifold import TSNE
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import AgglomerativeClustering
from sklearn.neighbors import NearestNeighbors
# Load MNIST dataset
mnist = fetch_openml('mnist_784', version = 1, parser = 'auto')
X, y = mnist['data'], mnist['target']
np.random.seed(42)
The MNIST digits in this set are flattened arrays of 784 pixels. We can reshape them to 28x28 pixels and plot them using matplotlib.
https://colab.research.google.com/drive/1YLbUkQjSjZmh4wmHxqLvQE0xrLM_Q0Lu#scrollTo=6CDCddXXz0QF&printMode=true 1/10
3/26/25, 6:06 PM Copy of 09-clustering.ipynb - Colab
image = X.iloc[0, :].values.reshape((28, 28))
plt.imshow(image)
<matplotlib.image.AxesImage at 0x7e3a7b6a3610>
keyboard_arrow_down 1. Preparing the Data
a. Using the full dataset, normalize each column so that the minimum column value is 0, and the maximum is 1. (Hint: if your normalization
process leads to missing values, replace these with 0!)
b. Get a random sample of 10% of the data. The full dataset may take a while to run some of the below methods.
# a.
scaler = MinMaxScaler()
X_normalized = scaler.fit_transform(X)
# b.
sample_size = int(0.1 * X_normalized.shape[0])
X_sample = X_normalized[np.random.choice(X_normalized.shape[0], sample_size, replace=False), :]
y_sample = y.iloc[np.random.choice(len(y), sample_size, replace=False)]
keyboard_arrow_down 2. K-Means Clustering
a. Using the MNIST subset, determine the optimal k value for k-means according to the silhouette score. Use a range of k-values from 2 - 12.
b. Fit a k-means model with the optimal k value.
c. Using a dimensionality reduction method, t-SNE, generate a two-dimensional representation (called an embedding)of the MNIST subset.
(See an example here. Use default values, but setting random_state = 42.)
d. Create two side-by-side scatterplots using the t-SNE represenation using plt.subplots. Color the first fig according to the true labels and the
second according to k-means cluster labels. Be sure to include proper figure titles and a legend.
e. Describe the fit. Does this align with your expectation, given the silhouette score?
f. Fit k-means with 10 clusters and calculate ten slihouette score. How does the slihouette score compare with the optimal k? (If 10 is
optimal, just say so.)
# a.
sil_scores = []
for k in range(2, 13):
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(X_sample)
sil_score = silhouette_score(X_sample, kmeans.labels_)
sil_scores.append(sil_score)
optimal_k = np.argmax(sil_scores) + 2
print(f"Optimal k based on silhouette score: {optimal_k}")
Optimal k based on silhouette score: 2
https://colab.research.google.com/drive/1YLbUkQjSjZmh4wmHxqLvQE0xrLM_Q0Lu#scrollTo=6CDCddXXz0QF&printMode=true 2/10
3/26/25, 6:06 PM Copy of 09-clustering.ipynb - Colab
# b.
kmeans = KMeans(n_clusters=optimal_k, random_state=42)
kmeans.fit(X_sample)
kmeans_labels = kmeans.labels_
# c.
tsne = TSNE(random_state=42)
X_tsne = tsne.fit_transform(X_sample)
# d.
fig, axes = plt.subplots(1, 2, figsize=(12, 6))
axes[0].scatter(X_tsne[:, 0], X_tsne[:, 1], c=y_sample.astype(int), cmap='tab10')
axes[0].set_title('True Labels')
axes[0].legend(title='Labels', loc='upper right')
axes[1].scatter(X_tsne[:, 0], X_tsne[:, 1], c=kmeans_labels, cmap='tab10')
axes[1].set_title('K-Means Clusters')
plt.show()
<ipython-input-14-8d58cb7ca3d8>:5: UserWarning: No artists with labels found to put in legend. Note that artists whose label start with
axes[0].legend(title='Labels', loc='upper right')
Part e comments here: The K-Means clustering fit shows some alignment with the true labels, with distinct clusters for digits like 0 and 1, but
there is overlap for visually similar digits like 4 and 9. The silhouette score shows moderate cluster separation, but improvement is needed for
better distinction.
# f.
kmeans_10 = KMeans(n_clusters=10, random_state=42)
kmeans_10.fit(X_sample)
kmeans_10_labels = kmeans_10.labels_
sil_score_10 = silhouette_score(X_sample, kmeans_10_labels)
print(f"Silhouette score for 10 clusters: {sil_score_10}")
Silhouette score for 10 clusters: 0.05929741767153872
f. comments here:
keyboard_arrow_down 3. Hierarchical Clustering
Here you will be performing hierarchical clustering on the same data subset.
https://colab.research.google.com/drive/1YLbUkQjSjZmh4wmHxqLvQE0xrLM_Q0Lu#scrollTo=6CDCddXXz0QF&printMode=true 3/10
3/26/25, 6:06 PM Copy of 09-clustering.ipynb - Colab
Here you will be comparing linkage methods and metrics. Use the following linkage/metric combinations:
Linkage Method Metrics
Ward Euclidean
Single Euclidean, Cosine, Manhattan
Complete Euclidean, Cosine, Manhattan
Average Euclidean, Cosine, Manhattan
a. Fit hierarhiccal clustering with 10 clusters for each combination and store the silhouette scores.
b. According to the silhouette scores, which combination is optimal? Fit an HC model with this combination and store the cluster labels
(make these a different variable name from that used for your k-means labels; we will eventually be comparing all clustering models).
c. As in part 2. d., plot side-by-side scatterplots using the same t-SNE embedding.
d. Describe the fit. Does this align with your expectation, given the silhouette score?
# a.
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import silhouette_score
linkage_methods = ['ward', 'single', 'complete', 'average']
metrics = ['euclidean', 'manhattan', 'cosine']
for linkage in linkage_methods:
for metric in metrics:
if linkage == 'ward':
hc = AgglomerativeClustering(n_clusters=10, linkage=linkage)
else:
hc = AgglomerativeClustering(n_clusters=10, linkage=linkage, affinity=metric)
hc_labels = hc.fit_predict(X_sample)
sil_score_hc = silhouette_score(X_sample, hc_labels)
print(f"Linkage: {linkage}, Metric: {metric}, Silhouette Score: {sil_score_hc}")
best_combination = max(sil_scores_hc, key=sil_scores_hc.get)
print(f"Best linkage and metric combination: {best_combination} with silhouette score: {sil_scores_hc[best_combination]}")
Linkage: ward, Metric: euclidean, Silhouette Score: 0.04413657626747381
Linkage: ward, Metric: manhattan, Silhouette Score: 0.04413657626747381
Linkage: ward, Metric: cosine, Silhouette Score: 0.04413657626747381
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-29-fa1d11fee29b> in <cell line: 0>()
14 else:
15 # For other linkages, 'affinity' is required
---> 16 hc = AgglomerativeClustering(n_clusters=10, linkage=linkage, affinity=metric)
17
18 # Fit the model and get the labels
TypeError: AgglomerativeClustering.__init__() got an unexpected keyword argument 'affinity'
Next steps: Explain error
# b.
hc_best = AgglomerativeClustering(n_clusters=10, linkage=best_combination[0], affinity=best_combination[1])
hc_labels = hc_best.fit_predict(X_sample)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-30-547ae60ea836> in <cell line: 0>()
1 # b.
----> 2 hc_best = AgglomerativeClustering(n_clusters=10, linkage=best_combination[0], affinity=best_combination[1])
3 hc_labels = hc_best.fit_predict(X_sample)
NameError: name 'best_combination' is not defined
Next steps: Explain error
# c.
fig, axes = plt.subplots(1, 2, figsize=(12, 6))
axes[0].scatter(X_tsne[:, 0], X_tsne[:, 1], c=y_sample.astype(int), cmap='tab10')
https://colab.research.google.com/drive/1YLbUkQjSjZmh4wmHxqLvQE0xrLM_Q0Lu#scrollTo=6CDCddXXz0QF&printMode=true 4/10
3/26/25, 6:06 PM Copy of 09-clustering.ipynb - Colab
axes[0].set_title('True Labels')
axes[0].legend(title='Labels', loc='upper right')
axes[1].scatter(X_tsne[:, 0], X_tsne[:, 1], c=hc_labels, cmap='tab10')
axes[1].set_title('Hierarchical Clusters')
plt.show()
<ipython-input-31-57e169f435ca>:6: UserWarning: No artists with labels found to put in legend. Note that artists whose label start with
axes[0].legend(title='Labels', loc='upper right')
Part d comments here:
keyboard_arrow_down 4. DBSCAN Clustering
DBSCAN can be a little tricky to tune due to its two hyperparameters, eps and min_samples . The eps parameter is the maximum distance
between two samples for one to be considered as in the neighborhood of the other. The min_samples parameter is the number of samples in
a neighborhood for a point to be considered as a core point. Here, we will attempt to find optimal hyperparameter combinations, focusing
primarily on eps .
a. Setting min_samples A rule of thumb is to set min_samples to be the number of dimensions (columns) in the dataset plus one. Here, you
will try two different values:
1. Follow the rule of thumb.
2. Set min_samples to the number of observations in the sample divided by 10 (roughly equal numbers per digit label).
For each min_samples value k (both cases):
i. Calculate the distance between every data point and its kth closest neighbor.
ii. Compute the average kth nearest neighbor distances (for both min_samples values).
iii. Compute the standard deviation of the kth nearest neighbor distances (for both min_samples values).
iv. Generate a set of 10 eps values equally spaced between mean ± standard deviation (for each min_samples value).
b. Using the sets of min_samples and eps values, apply DBSCAN and record all silhouette scores.
c. Fit DBSCAN using the best hyperparameters based on the silhouette scores.
d. Plot the t-SNE embedded values using subplots:
One colored by true labels.
Another colored by cluster labels.
e. Describe the clustering results and comment on your findings.
https://colab.research.google.com/drive/1YLbUkQjSjZmh4wmHxqLvQE0xrLM_Q0Lu#scrollTo=6CDCddXXz0QF&printMode=true 5/10
3/26/25, 6:06 PM Copy of 09-clustering.ipynb - Colab
# a.
min_samples1 = X_sample.shape[1] + 1
min_samples2 = X_sample.shape[0] // 10
def calc_kth_distances(X, k):
nbrs = NearestNeighbors(n_neighbors=k).fit(X)
distances, _ = nbrs.kneighbors(X)
kth_distances = distances[:, -1]
return kth_distances
kth_distances1 = calc_kth_distances(X_sample, min_samples1)
kth_distances2 = calc_kth_distances(X_sample, min_samples2)
mean_dist1, std_dist1 = np.mean(kth_distances1), np.std(kth_distances1)
mean_dist2, std_dist2 = np.mean(kth_distances2), np.std(kth_distances2)
print(f"Mean distance for min_samples = {min_samples1}: {mean_dist1}, Std: {std_dist1}")
print(f"Mean distance for min_samples = {min_samples2}: {mean_dist2}, Std: {std_dist2}")
Mean distance for min_samples = 785: 8.790720300862194, Std: 0.9875858071274719
Mean distance for min_samples = 700: 8.699563647113994, Std: 1.0056822617735457
# b.
eps_values1 = np.linspace(mean_dist1 - std_dist1, mean_dist1 + std_dist1, 10)
eps_values2 = np.linspace(mean_dist2 - std_dist2, mean_dist2 + std_dist2, 10)
print(f"Generated eps values for min_samples = {min_samples1}: {eps_values1}")
print(f"Generated eps values for min_samples = {min_samples2}: {eps_values2}")
Generated eps values for min_samples = 785: [7.80313449 8.02259801 8.24206152 8.46152503 8.68098854 8.90045206
9.11991557 9.33937908 9.5588426 9.77830611]
Generated eps values for min_samples = 700: [7.69388139 7.91736633 8.14085128 8.36433623 8.58782117 8.81130612
9.03479107 9.25827601 9.48176096 9.70524591]
# c.
sil_scores_dbscan = []
for min_samples_val, eps_vals in [(min_samples1, eps_values1), (min_samples2, eps_values2)]:
for eps in eps_vals:
dbscan = DBSCAN(eps=eps, min_samples=min_samples_val)
dbscan_labels = dbscan.fit_predict(X_sample)
if len(set(dbscan_labels)) > 1:
sil_score_dbscan = silhouette_score(X_sample, dbscan_labels)
sil_scores_dbscan.append((min_samples_val, eps, sil_score_dbscan))
best_dbscan_combination = max(sil_scores_dbscan, key=lambda x: x[2])
print(f"Best DBSCAN combination: min_samples={best_dbscan_combination[0]}, eps={best_dbscan_combination[1]} with silhouette score: {best_dbs
Best DBSCAN combination: min_samples=785, eps=9.778306107989666 with silhouette score: 0.2484688014770847
# d.
dbscan_best = DBSCAN(eps=best_dbscan_combination[1], min_samples=best_dbscan_combination[0])
dbscan_labels = dbscan_best.fit_predict(X_sample)
fig, axes = plt.subplots(1, 2, figsize=(12, 6))
axes[0].scatter(X_tsne[:, 0], X_tsne[:, 1], c=y_sample.astype(int), cmap='tab10')
axes[0].set_title('True Labels')
axes[1].scatter(X_tsne[:, 0], X_tsne[:, 1], c=dbscan_labels, cmap='tab10')
axes[1].set_title('DBSCAN Clusters')
plt.show()
https://colab.research.google.com/drive/1YLbUkQjSjZmh4wmHxqLvQE0xrLM_Q0Lu#scrollTo=6CDCddXXz0QF&printMode=true 6/10
3/26/25, 6:06 PM Copy of 09-clustering.ipynb - Colab
Part e comments here: The DBSCAN clustering results show that the algorithm struggles with defining distinct clusters, probably due to the
noise in the data and the difficulty in choosing optimal paramaters for eps and min_samples. This can lead to a poor fit compared to other
clustering methods.
keyboard_arrow_down 5. Adjusted Rand Index (ARI) in Clustering
The Adjusted Rand Index (ARI) is a metric used to evaluate the similarity between two clusterings, accounting for chance. It measures how
well a clustering algorithm’s results match a known ground truth or reference clustering.
Intuition
Suppose you have a set of data points, and you classify them into groups (clusters).
ARI compares your clustering to a "correct" classification and checks how often pairs of points are correctly grouped together or
correctly separated in both cases.
A simple Rand Index gives a raw similarity score, but ARI adjusts for randomness—ensuring that random assignments don’t get an
artificially high score.
How It's Used
Benchmarking Clustering Algorithms: ARI is useful for comparing different clustering methods to a gold-standard classification.
No Bias Toward Number of Clusters: Unlike some other metrics, ARI corrects for the number of clusters, making it more reliable when
clusters differ in size or number.
Interpreting Results
Range: ARI ranges from -1 to 1.
1 = Perfect agreement with ground truth
0 = Random clustering
Negative values = Worse than random (unlikely in practice)
Application: Used in cases like image segmentation, document clustering, or biological data grouping, where a reference classification
exists.
Python Implementation
Import via sklearn.metrics (the function adjusted_rand_score )
Inputs: True Labels, Predicted Labels (E.g., cluster labels)
Output: The Adjusted Rand Index Value
Questions:
https://colab.research.google.com/drive/1YLbUkQjSjZmh4wmHxqLvQE0xrLM_Q0Lu#scrollTo=6CDCddXXz0QF&printMode=true 7/10
3/26/25, 6:06 PM Copy of 09-clustering.ipynb - Colab
a. For each clustering method (k-means, hierarchical clustering, DBSCAN), calculate the adjusted rand index using the models with optimal
hyperparameters.
b. Plot all four scatterplots using subplots (1 row, 4 columns). The first colored by the true values, and the rest colored by your cluster labels
(k-means, HC, DBSCAN).
c. Do the scores seem reflecive of the assigned cluster labels, according to the plots? Please explain.
# a.
ari_kmeans = adjusted_rand_score(y_sample, kmeans_labels)
ari_hc = adjusted_rand_score(y_sample, hc_labels)
ari_dbscan = adjusted_rand_score(y_sample, dbscan_labels)
print(f"ARI for K-Means: {ari_kmeans}")
print(f"ARI for Hierarchical Clustering: {ari_hc}")
print(f"ARI for DBSCAN: {ari_dbscan}")
ARI for K-Means: -0.00015513136315309682
ARI for Hierarchical Clustering: -0.00025534244962585405
ARI for DBSCAN: 2.491890088874842e-07
# b.
fig, axes = plt.subplots(1, 4, figsize=(20, 6))
#True labels plot
axes[0].scatter(X_tsne[:, 0], X_tsne[:, 1], c=y_sample.astype(int), cmap='tab10')
axes[0].set_title('True Labels')
#K-Means labels plot
axes[1].scatter(X_tsne[:, 0], X_tsne[:, 1], c=kmeans_labels, cmap='tab10')
axes[1].set_title('K-Means Clusters')
#Hierarchical Clustering labels plot
axes[2].scatter(X_tsne[:, 0], X_tsne[:, 1], c=hc_labels, cmap='tab10')
axes[2].set_title('Hierarchical Clusters')
#DBSCAN labels plot
axes[3].scatter(X_tsne[:, 0], X_tsne[:, 1], c=dbscan_labels, cmap='tab10')
axes[3].set_title('DBSCAN Clusters')
plt.show()
Part c comments here: The Adjusted Rand Index scores show me that k-means and hierarchical clustering performed better than DBSCAN in
terms of aligning with the true labels. The ARI values reflect the clustering algorithms' ability to group similar digits more accurately.
keyboard_arrow_down 6. Visualizing Clustered Digits
https://colab.research.google.com/drive/1YLbUkQjSjZmh4wmHxqLvQE0xrLM_Q0Lu#scrollTo=6CDCddXXz0QF&printMode=true 8/10
3/26/25, 6:06 PM Copy of 09-clustering.ipynb - Colab
Pick one of the cluster labels from one of the clustering methods (your choice). Select a random subset of 20 points belonging to that cluster.
a. Using subplots (5 rows, 4 columns), plot the 20 digit images.
b. Comment on the clustered points (e.g., do they look similar?, are the representing the same digit?, etc.)
# a.
cluster_label = 0
indices = np.where(kmeans_labels == cluster_label)[0]
random_indices = np.random.choice(indices, 20, replace=False)
fig, axes = plt.subplots(5, 4, figsize=(10, 12))
for i, idx in enumerate(random_indices):
ax = axes[i // 4, i % 4]
image = X_sample[idx].reshape(28, 28)
ax.imshow(image, cmap='gray')
ax.axis('off')
plt.show()
Part b comments here: The selected cluster contains digits that appear visually similar but they are not all the same digit. This suggests that
the clustering might group different digit classes together. K-means does not guarantee perfect grouping for digits that share similar pixel
patterns.
https://colab.research.google.com/drive/1YLbUkQjSjZmh4wmHxqLvQE0xrLM_Q0Lu#scrollTo=6CDCddXXz0QF&printMode=true 9/10
3/26/25, 6:06 PM Copy of 09-clustering.ipynb - Colab
https://colab.research.google.com/drive/1YLbUkQjSjZmh4wmHxqLvQE0xrLM_Q0Lu#scrollTo=6CDCddXXz0QF&printMode=true 10/10