[go: up one dir, main page]

0% found this document useful (0 votes)
16 views10 pages

09-Clustering - Ipynb - Colab

Uploaded by

drewjvest
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views10 pages

09-Clustering - Ipynb - Colab

Uploaded by

drewjvest
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

3/26/25, 6:06 PM Copy of 09-clustering.

ipynb - Colab

Open in Colab

After clicking the "Open in Colab" link, copy the notebook to your own Google Drive before getting started, or it will not save your work

One of the most straightforward tasks we can perform on a data set without labels is to find groups of data in our dataset which are similar
to one another -- what we call clusters.

K-Means is one of the most popular "clustering" algorithms. K-means stores k centroids that it uses to define clusters. A point is considered
to be in a particular cluster if it is closer to that cluster's centroid than any other centroid.

K-Means finds the best centroids by alternating between (1) assigning data points to clusters based on the current centroids (2) chosing
centroids (points which are the center of a cluster) based on the current assignment of data points to clusters.

keyboard_arrow_down Dataset: MNIST


keyboard_arrow_down MNIST Dataset: A Brief Overview
Introduction
The MNIST (Modified National Institute of Standards and Technology) dataset is one of the most widely used datasets in machine learning,
particularly for benchmarking image classification algorithms.

Description
Type: Handwritten digit dataset
Size: 70,000 grayscale images (28x28 pixels)

Training set: 60,000 images


Test set: 10,000 images

Classes: 10 (Digits 0-9)


Format: Each image is a 28×28 matrix with pixel values ranging from 0 (black) to 255 (white).

Why MNIST?

Simple yet challenging enough for ML research.


Well-structured and preprocessed, eliminating the need for extensive data cleaning.

Accessing MNIST

Available in popular ML libraries:

TensorFlow/Keras: tf.keras.datasets.mnist
PyTorch: torchvision.datasets.MNIST
Scikit-learn: fetch_openml('mnist_784')

from sklearn.datasets import fetch_openml


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.cluster.hierarchy as sch
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.metrics import silhouette_score
from sklearn.metrics import adjusted_rand_score
from sklearn.manifold import TSNE
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import AgglomerativeClustering
from sklearn.neighbors import NearestNeighbors

# Load MNIST dataset


mnist = fetch_openml('mnist_784', version = 1, parser = 'auto')
X, y = mnist['data'], mnist['target']

np.random.seed(42)

The MNIST digits in this set are flattened arrays of 784 pixels. We can reshape them to 28x28 pixels and plot them using matplotlib.

https://colab.research.google.com/drive/1YLbUkQjSjZmh4wmHxqLvQE0xrLM_Q0Lu#scrollTo=6CDCddXXz0QF&printMode=true 1/10
3/26/25, 6:06 PM Copy of 09-clustering.ipynb - Colab
image = X.iloc[0, :].values.reshape((28, 28))
plt.imshow(image)

<matplotlib.image.AxesImage at 0x7e3a7b6a3610>

 

keyboard_arrow_down 1. Preparing the Data


a. Using the full dataset, normalize each column so that the minimum column value is 0, and the maximum is 1. (Hint: if your normalization
process leads to missing values, replace these with 0!)

b. Get a random sample of 10% of the data. The full dataset may take a while to run some of the below methods.

# a.
scaler = MinMaxScaler()
X_normalized = scaler.fit_transform(X)

# b.
sample_size = int(0.1 * X_normalized.shape[0])
X_sample = X_normalized[np.random.choice(X_normalized.shape[0], sample_size, replace=False), :]
y_sample = y.iloc[np.random.choice(len(y), sample_size, replace=False)]

keyboard_arrow_down 2. K-Means Clustering


a. Using the MNIST subset, determine the optimal k value for k-means according to the silhouette score. Use a range of k-values from 2 - 12.

b. Fit a k-means model with the optimal k value.

c. Using a dimensionality reduction method, t-SNE, generate a two-dimensional representation (called an embedding)of the MNIST subset.
(See an example here. Use default values, but setting random_state = 42.)

d. Create two side-by-side scatterplots using the t-SNE represenation using plt.subplots. Color the first fig according to the true labels and the
second according to k-means cluster labels. Be sure to include proper figure titles and a legend.

e. Describe the fit. Does this align with your expectation, given the silhouette score?

f. Fit k-means with 10 clusters and calculate ten slihouette score. How does the slihouette score compare with the optimal k? (If 10 is
optimal, just say so.)

# a.
sil_scores = []
for k in range(2, 13):
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(X_sample)
sil_score = silhouette_score(X_sample, kmeans.labels_)
sil_scores.append(sil_score)

optimal_k = np.argmax(sil_scores) + 2
print(f"Optimal k based on silhouette score: {optimal_k}")

Optimal k based on silhouette score: 2

https://colab.research.google.com/drive/1YLbUkQjSjZmh4wmHxqLvQE0xrLM_Q0Lu#scrollTo=6CDCddXXz0QF&printMode=true 2/10
3/26/25, 6:06 PM Copy of 09-clustering.ipynb - Colab

# b.
kmeans = KMeans(n_clusters=optimal_k, random_state=42)
kmeans.fit(X_sample)
kmeans_labels = kmeans.labels_

# c.
tsne = TSNE(random_state=42)
X_tsne = tsne.fit_transform(X_sample)

# d.
fig, axes = plt.subplots(1, 2, figsize=(12, 6))
axes[0].scatter(X_tsne[:, 0], X_tsne[:, 1], c=y_sample.astype(int), cmap='tab10')
axes[0].set_title('True Labels')
axes[0].legend(title='Labels', loc='upper right')
axes[1].scatter(X_tsne[:, 0], X_tsne[:, 1], c=kmeans_labels, cmap='tab10')
axes[1].set_title('K-Means Clusters')

plt.show()

<ipython-input-14-8d58cb7ca3d8>:5: UserWarning: No artists with labels found to put in legend. Note that artists whose label start with
axes[0].legend(title='Labels', loc='upper right')

Part e comments here: The K-Means clustering fit shows some alignment with the true labels, with distinct clusters for digits like 0 and 1, but
there is overlap for visually similar digits like 4 and 9. The silhouette score shows moderate cluster separation, but improvement is needed for
better distinction.

# f.
kmeans_10 = KMeans(n_clusters=10, random_state=42)
kmeans_10.fit(X_sample)
kmeans_10_labels = kmeans_10.labels_

sil_score_10 = silhouette_score(X_sample, kmeans_10_labels)


print(f"Silhouette score for 10 clusters: {sil_score_10}")

Silhouette score for 10 clusters: 0.05929741767153872

f. comments here:

keyboard_arrow_down 3. Hierarchical Clustering


Here you will be performing hierarchical clustering on the same data subset.

https://colab.research.google.com/drive/1YLbUkQjSjZmh4wmHxqLvQE0xrLM_Q0Lu#scrollTo=6CDCddXXz0QF&printMode=true 3/10
3/26/25, 6:06 PM Copy of 09-clustering.ipynb - Colab
Here you will be comparing linkage methods and metrics. Use the following linkage/metric combinations:
Linkage Method Metrics

Ward Euclidean
Single Euclidean, Cosine, Manhattan

Complete Euclidean, Cosine, Manhattan


Average Euclidean, Cosine, Manhattan

a. Fit hierarhiccal clustering with 10 clusters for each combination and store the silhouette scores.

b. According to the silhouette scores, which combination is optimal? Fit an HC model with this combination and store the cluster labels
(make these a different variable name from that used for your k-means labels; we will eventually be comparing all clustering models).

c. As in part 2. d., plot side-by-side scatterplots using the same t-SNE embedding.

d. Describe the fit. Does this align with your expectation, given the silhouette score?

# a.
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import silhouette_score

linkage_methods = ['ward', 'single', 'complete', 'average']


metrics = ['euclidean', 'manhattan', 'cosine']

for linkage in linkage_methods:


for metric in metrics:
if linkage == 'ward':
hc = AgglomerativeClustering(n_clusters=10, linkage=linkage)
else:
hc = AgglomerativeClustering(n_clusters=10, linkage=linkage, affinity=metric)

hc_labels = hc.fit_predict(X_sample)

sil_score_hc = silhouette_score(X_sample, hc_labels)


print(f"Linkage: {linkage}, Metric: {metric}, Silhouette Score: {sil_score_hc}")

best_combination = max(sil_scores_hc, key=sil_scores_hc.get)


print(f"Best linkage and metric combination: {best_combination} with silhouette score: {sil_scores_hc[best_combination]}")

Linkage: ward, Metric: euclidean, Silhouette Score: 0.04413657626747381


Linkage: ward, Metric: manhattan, Silhouette Score: 0.04413657626747381
Linkage: ward, Metric: cosine, Silhouette Score: 0.04413657626747381
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-29-fa1d11fee29b> in <cell line: 0>()
14 else:
15 # For other linkages, 'affinity' is required
---> 16 hc = AgglomerativeClustering(n_clusters=10, linkage=linkage, affinity=metric)
17
18 # Fit the model and get the labels

TypeError: AgglomerativeClustering.__init__() got an unexpected keyword argument 'affinity'

 

Next steps: Explain error

# b.
hc_best = AgglomerativeClustering(n_clusters=10, linkage=best_combination[0], affinity=best_combination[1])
hc_labels = hc_best.fit_predict(X_sample)

---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-30-547ae60ea836> in <cell line: 0>()
1 # b.
----> 2 hc_best = AgglomerativeClustering(n_clusters=10, linkage=best_combination[0], affinity=best_combination[1])
3 hc_labels = hc_best.fit_predict(X_sample)

NameError: name 'best_combination' is not defined

 

Next steps: Explain error

# c.
fig, axes = plt.subplots(1, 2, figsize=(12, 6))

axes[0].scatter(X_tsne[:, 0], X_tsne[:, 1], c=y_sample.astype(int), cmap='tab10')

https://colab.research.google.com/drive/1YLbUkQjSjZmh4wmHxqLvQE0xrLM_Q0Lu#scrollTo=6CDCddXXz0QF&printMode=true 4/10
3/26/25, 6:06 PM Copy of 09-clustering.ipynb - Colab
axes[0].set_title('True Labels')
axes[0].legend(title='Labels', loc='upper right')

axes[1].scatter(X_tsne[:, 0], X_tsne[:, 1], c=hc_labels, cmap='tab10')


axes[1].set_title('Hierarchical Clusters')

plt.show()

<ipython-input-31-57e169f435ca>:6: UserWarning: No artists with labels found to put in legend. Note that artists whose label start with
axes[0].legend(title='Labels', loc='upper right')

Part d comments here:

keyboard_arrow_down 4. DBSCAN Clustering


DBSCAN can be a little tricky to tune due to its two hyperparameters, eps and min_samples . The eps parameter is the maximum distance
between two samples for one to be considered as in the neighborhood of the other. The min_samples parameter is the number of samples in
a neighborhood for a point to be considered as a core point. Here, we will attempt to find optimal hyperparameter combinations, focusing
primarily on eps .

a. Setting min_samples A rule of thumb is to set min_samples to be the number of dimensions (columns) in the dataset plus one. Here, you
will try two different values:

1. Follow the rule of thumb.


2. Set min_samples to the number of observations in the sample divided by 10 (roughly equal numbers per digit label).

For each min_samples value k (both cases):

i. Calculate the distance between every data point and its kth closest neighbor.
ii. Compute the average kth nearest neighbor distances (for both min_samples values).
iii. Compute the standard deviation of the kth nearest neighbor distances (for both min_samples values).
iv. Generate a set of 10 eps values equally spaced between mean ± standard deviation (for each min_samples value).

b. Using the sets of min_samples and eps values, apply DBSCAN and record all silhouette scores.

c. Fit DBSCAN using the best hyperparameters based on the silhouette scores.

d. Plot the t-SNE embedded values using subplots:

One colored by true labels.


Another colored by cluster labels.

e. Describe the clustering results and comment on your findings.

https://colab.research.google.com/drive/1YLbUkQjSjZmh4wmHxqLvQE0xrLM_Q0Lu#scrollTo=6CDCddXXz0QF&printMode=true 5/10
3/26/25, 6:06 PM Copy of 09-clustering.ipynb - Colab
# a.
min_samples1 = X_sample.shape[1] + 1
min_samples2 = X_sample.shape[0] // 10

def calc_kth_distances(X, k):


nbrs = NearestNeighbors(n_neighbors=k).fit(X)
distances, _ = nbrs.kneighbors(X)
kth_distances = distances[:, -1]
return kth_distances

kth_distances1 = calc_kth_distances(X_sample, min_samples1)


kth_distances2 = calc_kth_distances(X_sample, min_samples2)

mean_dist1, std_dist1 = np.mean(kth_distances1), np.std(kth_distances1)


mean_dist2, std_dist2 = np.mean(kth_distances2), np.std(kth_distances2)

print(f"Mean distance for min_samples = {min_samples1}: {mean_dist1}, Std: {std_dist1}")


print(f"Mean distance for min_samples = {min_samples2}: {mean_dist2}, Std: {std_dist2}")

Mean distance for min_samples = 785: 8.790720300862194, Std: 0.9875858071274719


Mean distance for min_samples = 700: 8.699563647113994, Std: 1.0056822617735457

# b.
eps_values1 = np.linspace(mean_dist1 - std_dist1, mean_dist1 + std_dist1, 10)
eps_values2 = np.linspace(mean_dist2 - std_dist2, mean_dist2 + std_dist2, 10)

print(f"Generated eps values for min_samples = {min_samples1}: {eps_values1}")


print(f"Generated eps values for min_samples = {min_samples2}: {eps_values2}")

Generated eps values for min_samples = 785: [7.80313449 8.02259801 8.24206152 8.46152503 8.68098854 8.90045206
9.11991557 9.33937908 9.5588426 9.77830611]
Generated eps values for min_samples = 700: [7.69388139 7.91736633 8.14085128 8.36433623 8.58782117 8.81130612
9.03479107 9.25827601 9.48176096 9.70524591]

# c.
sil_scores_dbscan = []
for min_samples_val, eps_vals in [(min_samples1, eps_values1), (min_samples2, eps_values2)]:
for eps in eps_vals:
dbscan = DBSCAN(eps=eps, min_samples=min_samples_val)
dbscan_labels = dbscan.fit_predict(X_sample)
if len(set(dbscan_labels)) > 1:
sil_score_dbscan = silhouette_score(X_sample, dbscan_labels)
sil_scores_dbscan.append((min_samples_val, eps, sil_score_dbscan))

best_dbscan_combination = max(sil_scores_dbscan, key=lambda x: x[2])


print(f"Best DBSCAN combination: min_samples={best_dbscan_combination[0]}, eps={best_dbscan_combination[1]} with silhouette score: {best_dbs

Best DBSCAN combination: min_samples=785, eps=9.778306107989666 with silhouette score: 0.2484688014770847

# d.
dbscan_best = DBSCAN(eps=best_dbscan_combination[1], min_samples=best_dbscan_combination[0])
dbscan_labels = dbscan_best.fit_predict(X_sample)

fig, axes = plt.subplots(1, 2, figsize=(12, 6))

axes[0].scatter(X_tsne[:, 0], X_tsne[:, 1], c=y_sample.astype(int), cmap='tab10')


axes[0].set_title('True Labels')

axes[1].scatter(X_tsne[:, 0], X_tsne[:, 1], c=dbscan_labels, cmap='tab10')


axes[1].set_title('DBSCAN Clusters')

plt.show()

https://colab.research.google.com/drive/1YLbUkQjSjZmh4wmHxqLvQE0xrLM_Q0Lu#scrollTo=6CDCddXXz0QF&printMode=true 6/10
3/26/25, 6:06 PM Copy of 09-clustering.ipynb - Colab

 

Part e comments here: The DBSCAN clustering results show that the algorithm struggles with defining distinct clusters, probably due to the
noise in the data and the difficulty in choosing optimal paramaters for eps and min_samples. This can lead to a poor fit compared to other
clustering methods.

keyboard_arrow_down 5. Adjusted Rand Index (ARI) in Clustering


The Adjusted Rand Index (ARI) is a metric used to evaluate the similarity between two clusterings, accounting for chance. It measures how
well a clustering algorithm’s results match a known ground truth or reference clustering.

Intuition

Suppose you have a set of data points, and you classify them into groups (clusters).
ARI compares your clustering to a "correct" classification and checks how often pairs of points are correctly grouped together or
correctly separated in both cases.
A simple Rand Index gives a raw similarity score, but ARI adjusts for randomness—ensuring that random assignments don’t get an
artificially high score.

How It's Used

Benchmarking Clustering Algorithms: ARI is useful for comparing different clustering methods to a gold-standard classification.
No Bias Toward Number of Clusters: Unlike some other metrics, ARI corrects for the number of clusters, making it more reliable when
clusters differ in size or number.

Interpreting Results

Range: ARI ranges from -1 to 1.

1 = Perfect agreement with ground truth


0 = Random clustering
Negative values = Worse than random (unlikely in practice)

Application: Used in cases like image segmentation, document clustering, or biological data grouping, where a reference classification
exists.

Python Implementation
Import via sklearn.metrics (the function adjusted_rand_score )
Inputs: True Labels, Predicted Labels (E.g., cluster labels)
Output: The Adjusted Rand Index Value

Questions:

https://colab.research.google.com/drive/1YLbUkQjSjZmh4wmHxqLvQE0xrLM_Q0Lu#scrollTo=6CDCddXXz0QF&printMode=true 7/10
3/26/25, 6:06 PM Copy of 09-clustering.ipynb - Colab
a. For each clustering method (k-means, hierarchical clustering, DBSCAN), calculate the adjusted rand index using the models with optimal
hyperparameters.

b. Plot all four scatterplots using subplots (1 row, 4 columns). The first colored by the true values, and the rest colored by your cluster labels
(k-means, HC, DBSCAN).

c. Do the scores seem reflecive of the assigned cluster labels, according to the plots? Please explain.

# a.
ari_kmeans = adjusted_rand_score(y_sample, kmeans_labels)
ari_hc = adjusted_rand_score(y_sample, hc_labels)
ari_dbscan = adjusted_rand_score(y_sample, dbscan_labels)

print(f"ARI for K-Means: {ari_kmeans}")


print(f"ARI for Hierarchical Clustering: {ari_hc}")
print(f"ARI for DBSCAN: {ari_dbscan}")

ARI for K-Means: -0.00015513136315309682


ARI for Hierarchical Clustering: -0.00025534244962585405
ARI for DBSCAN: 2.491890088874842e-07

# b.
fig, axes = plt.subplots(1, 4, figsize=(20, 6))

#True labels plot


axes[0].scatter(X_tsne[:, 0], X_tsne[:, 1], c=y_sample.astype(int), cmap='tab10')
axes[0].set_title('True Labels')

#K-Means labels plot


axes[1].scatter(X_tsne[:, 0], X_tsne[:, 1], c=kmeans_labels, cmap='tab10')
axes[1].set_title('K-Means Clusters')

#Hierarchical Clustering labels plot


axes[2].scatter(X_tsne[:, 0], X_tsne[:, 1], c=hc_labels, cmap='tab10')
axes[2].set_title('Hierarchical Clusters')

#DBSCAN labels plot


axes[3].scatter(X_tsne[:, 0], X_tsne[:, 1], c=dbscan_labels, cmap='tab10')
axes[3].set_title('DBSCAN Clusters')

plt.show()

Part c comments here: The Adjusted Rand Index scores show me that k-means and hierarchical clustering performed better than DBSCAN in
terms of aligning with the true labels. The ARI values reflect the clustering algorithms' ability to group similar digits more accurately.

keyboard_arrow_down 6. Visualizing Clustered Digits


https://colab.research.google.com/drive/1YLbUkQjSjZmh4wmHxqLvQE0xrLM_Q0Lu#scrollTo=6CDCddXXz0QF&printMode=true 8/10
3/26/25, 6:06 PM Copy of 09-clustering.ipynb - Colab

Pick one of the cluster labels from one of the clustering methods (your choice). Select a random subset of 20 points belonging to that cluster.

a. Using subplots (5 rows, 4 columns), plot the 20 digit images.


b. Comment on the clustered points (e.g., do they look similar?, are the representing the same digit?, etc.)

# a.
cluster_label = 0
indices = np.where(kmeans_labels == cluster_label)[0]
random_indices = np.random.choice(indices, 20, replace=False)

fig, axes = plt.subplots(5, 4, figsize=(10, 12))

for i, idx in enumerate(random_indices):


ax = axes[i // 4, i % 4]
image = X_sample[idx].reshape(28, 28)
ax.imshow(image, cmap='gray')
ax.axis('off')

plt.show()

Part b comments here: The selected cluster contains digits that appear visually similar but they are not all the same digit. This suggests that
the clustering might group different digit classes together. K-means does not guarantee perfect grouping for digits that share similar pixel
patterns.

https://colab.research.google.com/drive/1YLbUkQjSjZmh4wmHxqLvQE0xrLM_Q0Lu#scrollTo=6CDCddXXz0QF&printMode=true 9/10
3/26/25, 6:06 PM Copy of 09-clustering.ipynb - Colab

https://colab.research.google.com/drive/1YLbUkQjSjZmh4wmHxqLvQE0xrLM_Q0Lu#scrollTo=6CDCddXXz0QF&printMode=true 10/10

You might also like