0% found this document useful (0 votes)

16 views10 pages

09-Clustering - Ipynb - Colab

Uploaded by

drewjvest

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views10 pages

09-Clustering - Ipynb - Colab

Uploaded by

drewjvest

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

3/26/25, 6:06 PM Copy of 09-clustering.

ipynb - Colab

Open in Colab

After clicking the "Open in Colab" link, copy the notebook to your own Google Drive before getting started, or it will not save your work

One of the most straightforward tasks we can perform on a data set without labels is to find groups of data in our dataset which are similar
to one another -- what we call clusters.

K-Means is one of the most popular "clustering" algorithms. K-means stores k centroids that it uses to define clusters. A point is considered
to be in a particular cluster if it is closer to that cluster's centroid than any other centroid.

K-Means finds the best centroids by alternating between (1) assigning data points to clusters based on the current centroids (2) chosing
centroids (points which are the center of a cluster) based on the current assignment of data points to clusters.

keyboard_arrow_down Dataset: MNIST

keyboard_arrow_down MNIST Dataset: A Brief Overview
Introduction
The MNIST (Modified National Institute of Standards and Technology) dataset is one of the most widely used datasets in machine learning,
particularly for benchmarking image classification algorithms.

Description
Type: Handwritten digit dataset
Size: 70,000 grayscale images (28x28 pixels)

Training set: 60,000 images

Test set: 10,000 images

Classes: 10 (Digits 0-9)

Format: Each image is a 28×28 matrix with pixel values ranging from 0 (black) to 255 (white).

Why MNIST?

Simple yet challenging enough for ML research.

Well-structured and preprocessed, eliminating the need for extensive data cleaning.

Accessing MNIST

Available in popular ML libraries:

TensorFlow/Keras: tf.keras.datasets.mnist
PyTorch: torchvision.datasets.MNIST
Scikit-learn: fetch_openml('mnist_784')

from sklearn.datasets import fetch_openml

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.cluster.hierarchy as sch
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.metrics import silhouette_score
from sklearn.metrics import adjusted_rand_score
from sklearn.manifold import TSNE
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import AgglomerativeClustering
from sklearn.neighbors import NearestNeighbors

# Load MNIST dataset

mnist = fetch_openml('mnist_784', version = 1, parser = 'auto')
X, y = mnist['data'], mnist['target']

np.random.seed(42)

The MNIST digits in this set are flattened arrays of 784 pixels. We can reshape them to 28x28 pixels and plot them using matplotlib.

https://colab.research.google.com/drive/1YLbUkQjSjZmh4wmHxqLvQE0xrLM_Q0Lu#scrollTo=6CDCddXXz0QF&printMode=true 1/10
3/26/25, 6:06 PM Copy of 09-clustering.ipynb - Colab
image = X.iloc[0, :].values.reshape((28, 28))
plt.imshow(image)

<matplotlib.image.AxesImage at 0x7e3a7b6a3610>

 

keyboard_arrow_down 1. Preparing the Data

a. Using the full dataset, normalize each column so that the minimum column value is 0, and the maximum is 1. (Hint: if your normalization
process leads to missing values, replace these with 0!)

b. Get a random sample of 10% of the data. The full dataset may take a while to run some of the below methods.

# a.
scaler = MinMaxScaler()
X_normalized = scaler.fit_transform(X)

# b.
sample_size = int(0.1 * X_normalized.shape[0])
X_sample = X_normalized[np.random.choice(X_normalized.shape[0], sample_size, replace=False), :]
y_sample = y.iloc[np.random.choice(len(y), sample_size, replace=False)]

keyboard_arrow_down 2. K-Means Clustering

a. Using the MNIST subset, determine the optimal k value for k-means according to the silhouette score. Use a range of k-values from 2 - 12.

b. Fit a k-means model with the optimal k value.

c. Using a dimensionality reduction method, t-SNE, generate a two-dimensional representation (called an embedding)of the MNIST subset.
(See an example here. Use default values, but setting random_state = 42.)

d. Create two side-by-side scatterplots using the t-SNE represenation using plt.subplots. Color the first fig according to the true labels and the
second according to k-means cluster labels. Be sure to include proper figure titles and a legend.

e. Describe the fit. Does this align with your expectation, given the silhouette score?

f. Fit k-means with 10 clusters and calculate ten slihouette score. How does the slihouette score compare with the optimal k? (If 10 is
optimal, just say so.)

# a.
sil_scores = []
for k in range(2, 13):
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(X_sample)
sil_score = silhouette_score(X_sample, kmeans.labels_)
sil_scores.append(sil_score)

optimal_k = np.argmax(sil_scores) + 2
print(f"Optimal k based on silhouette score: {optimal_k}")

Optimal k based on silhouette score: 2

https://colab.research.google.com/drive/1YLbUkQjSjZmh4wmHxqLvQE0xrLM_Q0Lu#scrollTo=6CDCddXXz0QF&printMode=true 2/10
3/26/25, 6:06 PM Copy of 09-clustering.ipynb - Colab

# b.
kmeans = KMeans(n_clusters=optimal_k, random_state=42)
kmeans.fit(X_sample)
kmeans_labels = kmeans.labels_

# c.
tsne = TSNE(random_state=42)
X_tsne = tsne.fit_transform(X_sample)

# d.
fig, axes = plt.subplots(1, 2, figsize=(12, 6))
axes[0].scatter(X_tsne[:, 0], X_tsne[:, 1], c=y_sample.astype(int), cmap='tab10')
axes[0].set_title('True Labels')
axes[0].legend(title='Labels', loc='upper right')
axes[1].scatter(X_tsne[:, 0], X_tsne[:, 1], c=kmeans_labels, cmap='tab10')
axes[1].set_title('K-Means Clusters')

plt.show()

<ipython-input-14-8d58cb7ca3d8>:5: UserWarning: No artists with labels found to put in legend. Note that artists whose label start with
axes[0].legend(title='Labels', loc='upper right')

Part e comments here: The K-Means clustering fit shows some alignment with the true labels, with distinct clusters for digits like 0 and 1, but
there is overlap for visually similar digits like 4 and 9. The silhouette score shows moderate cluster separation, but improvement is needed for
better distinction.

# f.
kmeans_10 = KMeans(n_clusters=10, random_state=42)
kmeans_10.fit(X_sample)
kmeans_10_labels = kmeans_10.labels_

sil_score_10 = silhouette_score(X_sample, kmeans_10_labels)

print(f"Silhouette score for 10 clusters: {sil_score_10}")

Silhouette score for 10 clusters: 0.05929741767153872

f. comments here:

keyboard_arrow_down 3. Hierarchical Clustering

Here you will be performing hierarchical clustering on the same data subset.

https://colab.research.google.com/drive/1YLbUkQjSjZmh4wmHxqLvQE0xrLM_Q0Lu#scrollTo=6CDCddXXz0QF&printMode=true 3/10
3/26/25, 6:06 PM Copy of 09-clustering.ipynb - Colab
Here you will be comparing linkage methods and metrics. Use the following linkage/metric combinations:
Linkage Method Metrics

Ward Euclidean
Single Euclidean, Cosine, Manhattan

Complete Euclidean, Cosine, Manhattan

Average Euclidean, Cosine, Manhattan

a. Fit hierarhiccal clustering with 10 clusters for each combination and store the silhouette scores.

b. According to the silhouette scores, which combination is optimal? Fit an HC model with this combination and store the cluster labels
(make these a different variable name from that used for your k-means labels; we will eventually be comparing all clustering models).

c. As in part 2. d., plot side-by-side scatterplots using the same t-SNE embedding.

d. Describe the fit. Does this align with your expectation, given the silhouette score?

# a.
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import silhouette_score

linkage_methods = ['ward', 'single', 'complete', 'average']

metrics = ['euclidean', 'manhattan', 'cosine']

for linkage in linkage_methods:

for metric in metrics:
if linkage == 'ward':
hc = AgglomerativeClustering(n_clusters=10, linkage=linkage)
else:
hc = AgglomerativeClustering(n_clusters=10, linkage=linkage, affinity=metric)

hc_labels = hc.fit_predict(X_sample)

sil_score_hc = silhouette_score(X_sample, hc_labels)

print(f"Linkage: {linkage}, Metric: {metric}, Silhouette Score: {sil_score_hc}")

best_combination = max(sil_scores_hc, key=sil_scores_hc.get)

print(f"Best linkage and metric combination: {best_combination} with silhouette score: {sil_scores_hc[best_combination]}")

Linkage: ward, Metric: euclidean, Silhouette Score: 0.04413657626747381

Linkage: ward, Metric: manhattan, Silhouette Score: 0.04413657626747381
Linkage: ward, Metric: cosine, Silhouette Score: 0.04413657626747381
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-29-fa1d11fee29b> in <cell line: 0>()
14 else:
15 # For other linkages, 'affinity' is required
---> 16 hc = AgglomerativeClustering(n_clusters=10, linkage=linkage, affinity=metric)
17
18 # Fit the model and get the labels

TypeError: AgglomerativeClustering.init() got an unexpected keyword argument 'affinity'

 

Next steps: Explain error

# b.
hc_best = AgglomerativeClustering(n_clusters=10, linkage=best_combination[0], affinity=best_combination[1])
hc_labels = hc_best.fit_predict(X_sample)

---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-30-547ae60ea836> in <cell line: 0>()
1 # b.
----> 2 hc_best = AgglomerativeClustering(n_clusters=10, linkage=best_combination[0], affinity=best_combination[1])
3 hc_labels = hc_best.fit_predict(X_sample)

NameError: name 'best_combination' is not defined

 

Next steps: Explain error

# c.
fig, axes = plt.subplots(1, 2, figsize=(12, 6))

axes[0].scatter(X_tsne[:, 0], X_tsne[:, 1], c=y_sample.astype(int), cmap='tab10')

https://colab.research.google.com/drive/1YLbUkQjSjZmh4wmHxqLvQE0xrLM_Q0Lu#scrollTo=6CDCddXXz0QF&printMode=true 4/10
3/26/25, 6:06 PM Copy of 09-clustering.ipynb - Colab
axes[0].set_title('True Labels')
axes[0].legend(title='Labels', loc='upper right')

axes[1].scatter(X_tsne[:, 0], X_tsne[:, 1], c=hc_labels, cmap='tab10')

axes[1].set_title('Hierarchical Clusters')

plt.show()

<ipython-input-31-57e169f435ca>:6: UserWarning: No artists with labels found to put in legend. Note that artists whose label start with
axes[0].legend(title='Labels', loc='upper right')

Part d comments here:

keyboard_arrow_down 4. DBSCAN Clustering

DBSCAN can be a little tricky to tune due to its two hyperparameters, eps and min_samples . The eps parameter is the maximum distance
between two samples for one to be considered as in the neighborhood of the other. The min_samples parameter is the number of samples in
a neighborhood for a point to be considered as a core point. Here, we will attempt to find optimal hyperparameter combinations, focusing
primarily on eps .

a. Setting min_samples A rule of thumb is to set min_samples to be the number of dimensions (columns) in the dataset plus one. Here, you
will try two different values:

1. Follow the rule of thumb.

2. Set min_samples to the number of observations in the sample divided by 10 (roughly equal numbers per digit label).

For each min_samples value k (both cases):

i. Calculate the distance between every data point and its kth closest neighbor.
ii. Compute the average kth nearest neighbor distances (for both min_samples values).
iii. Compute the standard deviation of the kth nearest neighbor distances (for both min_samples values).
iv. Generate a set of 10 eps values equally spaced between mean ± standard deviation (for each min_samples value).

b. Using the sets of min_samples and eps values, apply DBSCAN and record all silhouette scores.

c. Fit DBSCAN using the best hyperparameters based on the silhouette scores.

d. Plot the t-SNE embedded values using subplots:

One colored by true labels.

Another colored by cluster labels.

e. Describe the clustering results and comment on your findings.

https://colab.research.google.com/drive/1YLbUkQjSjZmh4wmHxqLvQE0xrLM_Q0Lu#scrollTo=6CDCddXXz0QF&printMode=true 5/10
3/26/25, 6:06 PM Copy of 09-clustering.ipynb - Colab
# a.
min_samples1 = X_sample.shape[1] + 1
min_samples2 = X_sample.shape[0] // 10

def calc_kth_distances(X, k):

nbrs = NearestNeighbors(n_neighbors=k).fit(X)
distances, _ = nbrs.kneighbors(X)
kth_distances = distances[:, -1]
return kth_distances

kth_distances1 = calc_kth_distances(X_sample, min_samples1)

kth_distances2 = calc_kth_distances(X_sample, min_samples2)

mean_dist1, std_dist1 = np.mean(kth_distances1), np.std(kth_distances1)

mean_dist2, std_dist2 = np.mean(kth_distances2), np.std(kth_distances2)

print(f"Mean distance for min_samples = {min_samples1}: {mean_dist1}, Std: {std_dist1}")

print(f"Mean distance for min_samples = {min_samples2}: {mean_dist2}, Std: {std_dist2}")

Mean distance for min_samples = 785: 8.790720300862194, Std: 0.9875858071274719

Mean distance for min_samples = 700: 8.699563647113994, Std: 1.0056822617735457

# b.
eps_values1 = np.linspace(mean_dist1 - std_dist1, mean_dist1 + std_dist1, 10)
eps_values2 = np.linspace(mean_dist2 - std_dist2, mean_dist2 + std_dist2, 10)

print(f"Generated eps values for min_samples = {min_samples1}: {eps_values1}")

print(f"Generated eps values for min_samples = {min_samples2}: {eps_values2}")

Generated eps values for min_samples = 785: [7.80313449 8.02259801 8.24206152 8.46152503 8.68098854 8.90045206
9.11991557 9.33937908 9.5588426 9.77830611]
Generated eps values for min_samples = 700: [7.69388139 7.91736633 8.14085128 8.36433623 8.58782117 8.81130612
9.03479107 9.25827601 9.48176096 9.70524591]

# c.
sil_scores_dbscan = []
for min_samples_val, eps_vals in [(min_samples1, eps_values1), (min_samples2, eps_values2)]:
for eps in eps_vals:
dbscan = DBSCAN(eps=eps, min_samples=min_samples_val)
dbscan_labels = dbscan.fit_predict(X_sample)
if len(set(dbscan_labels)) > 1:
sil_score_dbscan = silhouette_score(X_sample, dbscan_labels)
sil_scores_dbscan.append((min_samples_val, eps, sil_score_dbscan))

best_dbscan_combination = max(sil_scores_dbscan, key=lambda x: x[2])

print(f"Best DBSCAN combination: min_samples={best_dbscan_combination[0]}, eps={best_dbscan_combination[1]} with silhouette score: {best_dbs

Best DBSCAN combination: min_samples=785, eps=9.778306107989666 with silhouette score: 0.2484688014770847

# d.
dbscan_best = DBSCAN(eps=best_dbscan_combination[1], min_samples=best_dbscan_combination[0])
dbscan_labels = dbscan_best.fit_predict(X_sample)

fig, axes = plt.subplots(1, 2, figsize=(12, 6))

axes[0].scatter(X_tsne[:, 0], X_tsne[:, 1], c=y_sample.astype(int), cmap='tab10')

axes[0].set_title('True Labels')

axes[1].scatter(X_tsne[:, 0], X_tsne[:, 1], c=dbscan_labels, cmap='tab10')

axes[1].set_title('DBSCAN Clusters')

plt.show()

https://colab.research.google.com/drive/1YLbUkQjSjZmh4wmHxqLvQE0xrLM_Q0Lu#scrollTo=6CDCddXXz0QF&printMode=true 6/10
3/26/25, 6:06 PM Copy of 09-clustering.ipynb - Colab

 

Part e comments here: The DBSCAN clustering results show that the algorithm struggles with defining distinct clusters, probably due to the
noise in the data and the difficulty in choosing optimal paramaters for eps and min_samples. This can lead to a poor fit compared to other
clustering methods.

keyboard_arrow_down 5. Adjusted Rand Index (ARI) in Clustering

The Adjusted Rand Index (ARI) is a metric used to evaluate the similarity between two clusterings, accounting for chance. It measures how
well a clustering algorithm’s results match a known ground truth or reference clustering.

Intuition

Suppose you have a set of data points, and you classify them into groups (clusters).
ARI compares your clustering to a "correct" classification and checks how often pairs of points are correctly grouped together or
correctly separated in both cases.
A simple Rand Index gives a raw similarity score, but ARI adjusts for randomness—ensuring that random assignments don’t get an
artificially high score.

How It's Used

Benchmarking Clustering Algorithms: ARI is useful for comparing different clustering methods to a gold-standard classification.
No Bias Toward Number of Clusters: Unlike some other metrics, ARI corrects for the number of clusters, making it more reliable when
clusters differ in size or number.

Interpreting Results

Range: ARI ranges from -1 to 1.

1 = Perfect agreement with ground truth

0 = Random clustering
Negative values = Worse than random (unlikely in practice)

Application: Used in cases like image segmentation, document clustering, or biological data grouping, where a reference classification
exists.

Python Implementation
Import via sklearn.metrics (the function adjusted_rand_score )
Inputs: True Labels, Predicted Labels (E.g., cluster labels)
Output: The Adjusted Rand Index Value

Questions:

https://colab.research.google.com/drive/1YLbUkQjSjZmh4wmHxqLvQE0xrLM_Q0Lu#scrollTo=6CDCddXXz0QF&printMode=true 7/10
3/26/25, 6:06 PM Copy of 09-clustering.ipynb - Colab
a. For each clustering method (k-means, hierarchical clustering, DBSCAN), calculate the adjusted rand index using the models with optimal
hyperparameters.

b. Plot all four scatterplots using subplots (1 row, 4 columns). The first colored by the true values, and the rest colored by your cluster labels
(k-means, HC, DBSCAN).

c. Do the scores seem reflecive of the assigned cluster labels, according to the plots? Please explain.

# a.
ari_kmeans = adjusted_rand_score(y_sample, kmeans_labels)
ari_hc = adjusted_rand_score(y_sample, hc_labels)
ari_dbscan = adjusted_rand_score(y_sample, dbscan_labels)

print(f"ARI for K-Means: {ari_kmeans}")

print(f"ARI for Hierarchical Clustering: {ari_hc}")
print(f"ARI for DBSCAN: {ari_dbscan}")

ARI for K-Means: -0.00015513136315309682

ARI for Hierarchical Clustering: -0.00025534244962585405
ARI for DBSCAN: 2.491890088874842e-07

# b.
fig, axes = plt.subplots(1, 4, figsize=(20, 6))

#True labels plot

axes[0].scatter(X_tsne[:, 0], X_tsne[:, 1], c=y_sample.astype(int), cmap='tab10')
axes[0].set_title('True Labels')

#K-Means labels plot

axes[1].scatter(X_tsne[:, 0], X_tsne[:, 1], c=kmeans_labels, cmap='tab10')
axes[1].set_title('K-Means Clusters')

#Hierarchical Clustering labels plot

axes[2].scatter(X_tsne[:, 0], X_tsne[:, 1], c=hc_labels, cmap='tab10')
axes[2].set_title('Hierarchical Clusters')

#DBSCAN labels plot

axes[3].scatter(X_tsne[:, 0], X_tsne[:, 1], c=dbscan_labels, cmap='tab10')
axes[3].set_title('DBSCAN Clusters')

plt.show()

Part c comments here: The Adjusted Rand Index scores show me that k-means and hierarchical clustering performed better than DBSCAN in
terms of aligning with the true labels. The ARI values reflect the clustering algorithms' ability to group similar digits more accurately.

keyboard_arrow_down 6. Visualizing Clustered Digits

https://colab.research.google.com/drive/1YLbUkQjSjZmh4wmHxqLvQE0xrLM_Q0Lu#scrollTo=6CDCddXXz0QF&printMode=true 8/10
3/26/25, 6:06 PM Copy of 09-clustering.ipynb - Colab

Pick one of the cluster labels from one of the clustering methods (your choice). Select a random subset of 20 points belonging to that cluster.

a. Using subplots (5 rows, 4 columns), plot the 20 digit images.

b. Comment on the clustered points (e.g., do they look similar?, are the representing the same digit?, etc.)

# a.
cluster_label = 0
indices = np.where(kmeans_labels == cluster_label)[0]
random_indices = np.random.choice(indices, 20, replace=False)

fig, axes = plt.subplots(5, 4, figsize=(10, 12))

for i, idx in enumerate(random_indices):

ax = axes[i // 4, i % 4]
image = X_sample[idx].reshape(28, 28)
ax.imshow(image, cmap='gray')
ax.axis('off')

plt.show()

Part b comments here: The selected cluster contains digits that appear visually similar but they are not all the same digit. This suggests that
the clustering might group different digit classes together. K-means does not guarantee perfect grouping for digits that share similar pixel
patterns.

https://colab.research.google.com/drive/1YLbUkQjSjZmh4wmHxqLvQE0xrLM_Q0Lu#scrollTo=6CDCddXXz0QF&printMode=true 9/10
3/26/25, 6:06 PM Copy of 09-clustering.ipynb - Colab

https://colab.research.google.com/drive/1YLbUkQjSjZmh4wmHxqLvQE0xrLM_Q0Lu#scrollTo=6CDCddXXz0QF&printMode=true 10/10

Lab6 Instruction
No ratings yet
Lab6 Instruction
3 pages
SE KMeansClustering
No ratings yet
SE KMeansClustering
21 pages
ML DSBA Lab7
No ratings yet
ML DSBA Lab7
6 pages
JAVIER KMeans Clustering Jupyter Notebook
No ratings yet
JAVIER KMeans Clustering Jupyter Notebook
7 pages
ML Clustering2
No ratings yet
ML Clustering2
11 pages
K.means Clustering
No ratings yet
K.means Clustering
8 pages
JNTUK R20 B.Tech CSE 3-2 Machine Learning Unit 4 Notes
No ratings yet
JNTUK R20 B.Tech CSE 3-2 Machine Learning Unit 4 Notes
23 pages
2021 Clustering
No ratings yet
2021 Clustering
50 pages
K Means
No ratings yet
K Means
25 pages
Detecting Patterns With Unsupervised Learning
No ratings yet
Detecting Patterns With Unsupervised Learning
21 pages
Unit-Iv Material
No ratings yet
Unit-Iv Material
24 pages
K-Means Algorithm
No ratings yet
K-Means Algorithm
29 pages
ML-Lab Programs - VTU
No ratings yet
ML-Lab Programs - VTU
5 pages
K-Means Clustering Explained
No ratings yet
K-Means Clustering Explained
10 pages
0006 - K Means Clustering - Introduction - 2025
No ratings yet
0006 - K Means Clustering - Introduction - 2025
19 pages
Day 3
No ratings yet
Day 3
74 pages
ML Unit-4
No ratings yet
ML Unit-4
23 pages
K - Means Clustering and Related Algorithms: Ryan P. Adams COS 324 - Elements of Machine Learning Princeton University
No ratings yet
K - Means Clustering and Related Algorithms: Ryan P. Adams COS 324 - Elements of Machine Learning Princeton University
18 pages
Neural Network Clustering Guide
No ratings yet
Neural Network Clustering Guide
168 pages
20 ENG 016 Assignment 8
No ratings yet
20 ENG 016 Assignment 8
4 pages
51 DA5400 - FML51 - 20250501 ProblemSet06
No ratings yet
51 DA5400 - FML51 - 20250501 ProblemSet06
4 pages
Experiment 4 1
No ratings yet
Experiment 4 1
4 pages
ML UNIT 4 Sir
No ratings yet
ML UNIT 4 Sir
42 pages
Python K-Means Clustering Guide
No ratings yet
Python K-Means Clustering Guide
6 pages
Wa0033.
No ratings yet
Wa0033.
38 pages
Machine Learning IV
No ratings yet
Machine Learning IV
54 pages
K-Means Data Clustering Approach: Jaipur National University
No ratings yet
K-Means Data Clustering Approach: Jaipur National University
43 pages
Computer Vision Clustering Guide
No ratings yet
Computer Vision Clustering Guide
41 pages
K-Means Clustering Guide
No ratings yet
K-Means Clustering Guide
26 pages
AI With Python - Unsupervised Learning - Clustering
No ratings yet
AI With Python - Unsupervised Learning - Clustering
12 pages
Session 10 Unsupervised K-Mean
No ratings yet
Session 10 Unsupervised K-Mean
26 pages
Unsupervised Learning: Clustering & Anomaly Detection
No ratings yet
Unsupervised Learning: Clustering & Anomaly Detection
50 pages
K Means
No ratings yet
K Means
9 pages
ML - K-Means
No ratings yet
ML - K-Means
12 pages
Machine Learning K Means - Unsupervised
No ratings yet
Machine Learning K Means - Unsupervised
5 pages
ML Exp5 C36
No ratings yet
ML Exp5 C36
18 pages
Lect 6 - Clustering
No ratings yet
Lect 6 - Clustering
50 pages
ML0101EN Clus K Means Customer Seg Py v1
100% (1)
ML0101EN Clus K Means Customer Seg Py v1
8 pages
14
No ratings yet
14
72 pages
ML 03 Clustering
No ratings yet
ML 03 Clustering
63 pages
Task 2
No ratings yet
Task 2
3 pages
Data Science: K-Means Clustering
No ratings yet
Data Science: K-Means Clustering
7 pages
ML0101EN Clus DBSCN Weather Py v1
No ratings yet
ML0101EN Clus DBSCN Weather Py v1
16 pages
Apriori Algorithm & Clustering Guide
No ratings yet
Apriori Algorithm & Clustering Guide
8 pages
Week 11
No ratings yet
Week 11
49 pages
Peer Eval
No ratings yet
Peer Eval
6 pages
Unit 4
No ratings yet
Unit 4
46 pages
DWDM Lab All
No ratings yet
DWDM Lab All
20 pages
Overview of Clustering:: UNIT-5
No ratings yet
Overview of Clustering:: UNIT-5
27 pages
Digital Image Processing: Segmentation-5
No ratings yet
Digital Image Processing: Segmentation-5
43 pages
Seminar 10
No ratings yet
Seminar 10
3 pages
Outline: Three Basic Algorithms
No ratings yet
Outline: Three Basic Algorithms
34 pages
Cluster Analysis Overview
No ratings yet
Cluster Analysis Overview
77 pages
K Means
100% (2)
K Means
329 pages
BDU - Document - Dominant Color in An Image Using K
No ratings yet
BDU - Document - Dominant Color in An Image Using K
46 pages
AI Ass 2
No ratings yet
AI Ass 2
32 pages
Experiment 3.1 K-Mean
No ratings yet
Experiment 3.1 K-Mean
8 pages
K-Means Clustering Report
No ratings yet
K-Means Clustering Report
2 pages
Lecture 2.1.1 To 2.1.2
No ratings yet
Lecture 2.1.1 To 2.1.2
97 pages
Header Lateral Systems Guide
No ratings yet
Header Lateral Systems Guide
2 pages
Mindfulness-Based Cognitive Therapy - Psychology Today Australia
No ratings yet
Mindfulness-Based Cognitive Therapy - Psychology Today Australia
8 pages
Bank Reconciliation Shortcuts
No ratings yet
Bank Reconciliation Shortcuts
3 pages
BEAMANAL (Metric) Copie
No ratings yet
BEAMANAL (Metric) Copie
19 pages
Flight Planning
100% (1)
Flight Planning
40 pages
D and F Block Past Papers
No ratings yet
D and F Block Past Papers
4 pages
Mastertop 100: Dry-Shake Surface Hardener For Concrete Industrial Floors
No ratings yet
Mastertop 100: Dry-Shake Surface Hardener For Concrete Industrial Floors
2 pages
1 005 9341 - KaVo Autoclave Ed 07
100% (1)
1 005 9341 - KaVo Autoclave Ed 07
36 pages
Sampling
No ratings yet
Sampling
14 pages
PTC Brief Recap Overview and Updates
No ratings yet
PTC Brief Recap Overview and Updates
85 pages
Bottle Extrusion Blow Moulding
No ratings yet
Bottle Extrusion Blow Moulding
2 pages
Accepted Manuscript: 10.1016/j.jlp.2017.09.011
No ratings yet
Accepted Manuscript: 10.1016/j.jlp.2017.09.011
24 pages
Action Research
No ratings yet
Action Research
9 pages
LLC Assignment-2 Digilocker: Introduction
No ratings yet
LLC Assignment-2 Digilocker: Introduction
2 pages
CHAPTER 10 Link Tech To Long Term Business Goals
No ratings yet
CHAPTER 10 Link Tech To Long Term Business Goals
4 pages
Arts Group 3
No ratings yet
Arts Group 3
16 pages
R-Wil303 Placement Form SEMESTER 1 2025
100% (1)
R-Wil303 Placement Form SEMESTER 1 2025
3 pages
Grade 7 Math Curriculum Overview
No ratings yet
Grade 7 Math Curriculum Overview
87 pages
Pearson Book
No ratings yet
Pearson Book
5 pages
Training Calendar 2023 - Aa Kenya
No ratings yet
Training Calendar 2023 - Aa Kenya
6 pages
Basic Engg. Math
No ratings yet
Basic Engg. Math
3 pages
Essentials of Psychology Concepts and Applications 5th Edition Textbook
0% (2)
Essentials of Psychology Concepts and Applications 5th Edition Textbook
13 pages
Narrative Report - DLAC - Varied Teaching Strategies
No ratings yet
Narrative Report - DLAC - Varied Teaching Strategies
9 pages
TCB Corporate Gifting Catalogue - Compressed
No ratings yet
TCB Corporate Gifting Catalogue - Compressed
11 pages
Astro Billing Statement Feb-Mar 2019
No ratings yet
Astro Billing Statement Feb-Mar 2019
2 pages
Kahoot! Admin Guide for Schools
No ratings yet
Kahoot! Admin Guide for Schools
13 pages
RAS Investment Guide
No ratings yet
RAS Investment Guide
14 pages
UnderstandingSpace-An Introduction To Astronautics
100% (22)
UnderstandingSpace-An Introduction To Astronautics
790 pages
Comprehensive Guide to Basic Hematology
No ratings yet
Comprehensive Guide to Basic Hematology
89 pages
Quotation w241100547535 Inbfr Gbsou
No ratings yet
Quotation w241100547535 Inbfr Gbsou
2 pages

09-Clustering - Ipynb - Colab

Uploaded by

09-Clustering - Ipynb - Colab

Uploaded by

3/26/25, 6:06 PM Copy of 09-clustering.

keyboard_arrow_down Dataset: MNIST

Training set: 60,000 images

Classes: 10 (Digits 0-9)

Simple yet challenging enough for ML research.

Available in popular ML libraries:

from sklearn.datasets import fetch_openml

# Load MNIST dataset

keyboard_arrow_down 1. Preparing the Data

keyboard_arrow_down 2. K-Means Clustering

b. Fit a k-means model with the optimal k value.

Optimal k based on silhouette score: 2

sil_score_10 = silhouette_score(X_sample, kmeans_10_labels)

Silhouette score for 10 clusters: 0.05929741767153872

keyboard_arrow_down 3. Hierarchical Clustering

Complete Euclidean, Cosine, Manhattan

linkage_methods = ['ward', 'single', 'complete', 'average']

for linkage in linkage_methods:

sil_score_hc = silhouette_score(X_sample, hc_labels)

best_combination = max(sil_scores_hc, key=sil_scores_hc.get)

Linkage: ward, Metric: euclidean, Silhouette Score: 0.04413657626747381

TypeError: AgglomerativeClustering.__init__() got an unexpected keyword argument 'affinity'

Next steps: Explain error

NameError: name 'best_combination' is not defined

Next steps: Explain error

axes[0].scatter(X_tsne[:, 0], X_tsne[:, 1], c=y_sample.astype(int), cmap='tab10')

axes[1].scatter(X_tsne[:, 0], X_tsne[:, 1], c=hc_labels, cmap='tab10')

Part d comments here:

keyboard_arrow_down 4. DBSCAN Clustering

1. Follow the rule of thumb.

For each min_samples value k (both cases):

d. Plot the t-SNE embedded values using subplots:

One colored by true labels.

e. Describe the clustering results and comment on your findings.

def calc_kth_distances(X, k):

kth_distances1 = calc_kth_distances(X_sample, min_samples1)

mean_dist1, std_dist1 = np.mean(kth_distances1), np.std(kth_distances1)

print(f"Mean distance for min_samples = {min_samples1}: {mean_dist1}, Std: {std_dist1}")

Mean distance for min_samples = 785: 8.790720300862194, Std: 0.9875858071274719

print(f"Generated eps values for min_samples = {min_samples1}: {eps_values1}")

best_dbscan_combination = max(sil_scores_dbscan, key=lambda x: x[2])

Best DBSCAN combination: min_samples=785, eps=9.778306107989666 with silhouette score: 0.2484688014770847

fig, axes = plt.subplots(1, 2, figsize=(12, 6))

axes[0].scatter(X_tsne[:, 0], X_tsne[:, 1], c=y_sample.astype(int), cmap='tab10')

axes[1].scatter(X_tsne[:, 0], X_tsne[:, 1], c=dbscan_labels, cmap='tab10')

keyboard_arrow_down 5. Adjusted Rand Index (ARI) in Clustering

How It's Used

Range: ARI ranges from -1 to 1.

1 = Perfect agreement with ground truth

print(f"ARI for K-Means: {ari_kmeans}")

ARI for K-Means: -0.00015513136315309682

#True labels plot

#K-Means labels plot

#Hierarchical Clustering labels plot

#DBSCAN labels plot

keyboard_arrow_down 6. Visualizing Clustered Digits

a. Using subplots (5 rows, 4 columns), plot the 20 digit images.

fig, axes = plt.subplots(5, 4, figsize=(10, 12))

for i, idx in enumerate(random_indices):

You might also like

TypeError: AgglomerativeClustering.init() got an unexpected keyword argument 'affinity'