100% found this document useful (2 votes)

159 views6 pages

Tutorial 2 - Clustering

This document is a Jupyter notebook that explores different clustering algorithms including K-Means clustering, DBSCAN clustering, and agglomerative clustering. It loads and explores a driver dataset, applies K-Means clustering to identify 4 clusters, visualizes the clusters, performs normalization before applying K-Means again, and compares the results. It then applies DBSCAN clustering before and after normalization. Finally, it performs agglomerative clustering and dendrogram visualization to identify clusters in the normalized data.

Uploaded by

Gupta Akshay

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (2 votes)

159 views6 pages

Tutorial 2 - Clustering

Uploaded by

Gupta Akshay

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

14/09/2018 Tutorial 2 - Clustering

In [13]:

import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
pd.set_option('display.float_format', lambda x: '%.3f' % x)
%matplotlib inline
import matplotlib.pyplot as plt

In [9]:

data = pd.read_csv("./driver_dataset.csv", sep='\t')

In [10]:

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4000 entries, 0 to 3999
Data columns (total 3 columns):
Driver_ID 4000 non-null int64
Distance_Feature 4000 non-null float64
Speeding_Feature 4000 non-null float64
dtypes: float64(2), int64(1)
memory usage: 93.8 KB

In [11]:

data.describe()

Out[11]:

Driver_ID Distance_Feature Speeding_Feature

count 4000.000 4000.000 4000.000

mean 3423312447.500 76.042 10.721

std 1154.845 53.470 13.709

min 3423310448.000 15.520 0.000

25% 3423311447.750 45.248 4.000

50% 3423312447.500 53.330 6.000

75% 3423313447.250 65.632 9.000

max 3423314447.000 244.790 100.000

http://localhost:8888/notebooks/Documents/BITS%20Course/DM/Tut/TUT2/Piyush_TUT/Tutorial%202%20-%20Clustering.ipynb# 1/7
14/09/2018 Tutorial 2 - Clustering

In [26]:

plt.scatter(data.iloc[:,1:2], data.iloc[:,2:3])
plt.xlabel(data.columns.values[1])
plt.ylabel(data.columns.values[2])
plt.show()

In [28]:

wcss = []
for i in range(1,11):
kmeans = KMeans(n_clusters = i,init = 'k-means++',random_state = 0)
kmeans.fit(data)
wcss.append(kmeans.inertia_)
plt.plot(range(1,11),wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of cluster')
plt.ylabel('WCSS')
plt.show()

In [52]:

kmeans = KMeans(n_clusters = 4,init = 'k-means++',random_state =0)

y_kmeans = kmeans.fit_predict(data)

http://localhost:8888/notebooks/Documents/BITS%20Course/DM/Tut/TUT2/Piyush_TUT/Tutorial%202%20-%20Clustering.ipynb# 2/7
14/09/2018 Tutorial 2 - Clustering

In [53]:

%matplotlib inline
plt.figsize=(40, 40)
plt.scatter(data.iloc[:,1],data.iloc[:,2], c=y_kmeans)

Out[53]:

<matplotlib.collections.PathCollection at 0x7f381ee64ba8>

In [47]:

from sklearn import preprocessing

#Performing Min_Max Normalization
min_max_scaler = preprocessing.MinMaxScaler()
np_scaled = min_max_scaler.fit_transform(data.iloc[:,1:])
dataN = pd.DataFrame(np_scaled)
dataN.head()

Out[47]:

0 1

0 0.243 0.280

1 0.161 0.250

2 0.214 0.270

3 0.175 0.220

4 0.170 0.250

In [50]:

kmeans = KMeans(n_clusters = 4,init = 'k-means++',random_state =0)

y2_kmeans = kmeans.fit_predict(dataN)

http://localhost:8888/notebooks/Documents/BITS%20Course/DM/Tut/TUT2/Piyush_TUT/Tutorial%202%20-%20Clustering.ipynb# 3/7
14/09/2018 Tutorial 2 - Clustering

In [59]:

%matplotlib inline
plt.scatter(data.iloc[:,1],data.iloc[:,2], c=y2_kmeans)

Out[59]:

<matplotlib.collections.PathCollection at 0x7f381c32eda0>

In [ ]:

#DBSCAN STARTS

In [78]:

from sklearn.cluster import DBSCAN

dbscan = DBSCAN(eps=0.1, metric='euclidean', min_samples=5)

In [79]:

dbsc = dbscan.fit(data)
dbsc.labels_

Out[79]:

array([-1, -1, -1, ..., -1, -1, -1])

http://localhost:8888/notebooks/Documents/BITS%20Course/DM/Tut/TUT2/Piyush_TUT/Tutorial%202%20-%20Clustering.ipynb# 4/7
14/09/2018 Tutorial 2 - Clustering

In [80]:

plt.scatter(data.iloc[:,1],data.iloc[:,2], c=dbsc.labels_)

Out[80]:

<matplotlib.collections.PathCollection at 0x7f38142e7550>

In [81]:

dbsc = dbscan.fit(dataN)
dbsc.labels_

Out[81]:

array([0, 0, 0, ..., 1, 1, 1])

In [82]:

plt.scatter(data.iloc[:,1],data.iloc[:,2], c=dbsc.labels_)

Out[82]:

<matplotlib.collections.PathCollection at 0x7f381437b198>

http://localhost:8888/notebooks/Documents/BITS%20Course/DM/Tut/TUT2/Piyush_TUT/Tutorial%202%20-%20Clustering.ipynb# 5/7
14/09/2018 Tutorial 2 - Clustering

In [66]:

model.labels_

Out[66]:

array([-1, -1, -1, ..., -1, -1, -1])

In [ ]:

#AGGLOMERATIVE STARTS

In [67]:

from sklearn.cluster import AgglomerativeClustering as AC

aggclus = AC(n_clusters = 4,affinity='euclidean',linkage='ward',compute_full_tree='
y_aggclus= aggclus.fit_predict(data.iloc[:,1:3])

In [68]:

y_aggclus

Out[68]:

array([3, 3, 3, ..., 1, 1, 1])

In [69]:

from scipy.cluster.hierarchy import dendrogram, linkage,cut_tree

from scipy.cluster.hierarchy import fcluster
k=4
linkage_matrix = linkage(dataN, "ward",metric="euclidean")
ddata=dendrogram(linkage_matrix,color_threshold=1.5)

In [83]:

ddata=dendrogram(linkage_matrix,color_threshold=1.5)
plt.figure(figsize=(5,7))

Out[83]:

<Figure size 360x504 with 0 Axes>

http://localhost:8888/notebooks/Documents/BITS%20Course/DM/Tut/TUT2/Piyush_TUT/Tutorial%202%20-%20Clustering.ipynb# 6/7

Machine Learning Project Analysis
No ratings yet
Machine Learning Project Analysis
114 pages
Statisitics Project 6
100% (2)
Statisitics Project 6
48 pages
Name: Siti Mursyida Abdul Karim (Data Science Program) Topic: Assignment - EDA
100% (1)
Name: Siti Mursyida Abdul Karim (Data Science Program) Topic: Assignment - EDA
13 pages
Predictive Modeling - Supporting File1
No ratings yet
Predictive Modeling - Supporting File1
3 pages
Jupyter Notebook Project CART RF ANN
100% (1)
Jupyter Notebook Project CART RF ANN
41 pages
Cart-Rf-Ann: Prepared by Muralidharan N
67% (3)
Cart-Rf-Ann: Prepared by Muralidharan N
33 pages
Assignment ML
100% (2)
Assignment ML
21 pages
Why Do You Need To Scale Data in KNN: 3 Answers
No ratings yet
Why Do You Need To Scale Data in KNN: 3 Answers
1 page
SMDM Project Report
100% (1)
SMDM Project Report
19 pages
Association Rules Ans
No ratings yet
Association Rules Ans
28 pages
ML - Project - Business Report
No ratings yet
ML - Project - Business Report
43 pages
VARUNSAINI - 13 Nov 2022
No ratings yet
VARUNSAINI - 13 Nov 2022
14 pages
DataMining Aug2021
100% (2)
DataMining Aug2021
49 pages
Uber Trip Data Analysis
No ratings yet
Uber Trip Data Analysis
10 pages
Data Mining Project - 27.06.2021
No ratings yet
Data Mining Project - 27.06.2021
6 pages
SMDM Project Report-Survi Ghura
100% (1)
SMDM Project Report-Survi Ghura
26 pages
Palash Bhai - Machine Learning Assignment
100% (2)
Palash Bhai - Machine Learning Assignment
18 pages
Asphalt Shingles Data Analysis PDF
No ratings yet
Asphalt Shingles Data Analysis PDF
4 pages
SMDM Project Report
100% (1)
SMDM Project Report
9 pages
Assignment Clustering
No ratings yet
Assignment Clustering
22 pages
Predicting Mode of Transport (ML) : Akalya KS
No ratings yet
Predicting Mode of Transport (ML) : Akalya KS
17 pages
Rahulsharma - 03 12 23
No ratings yet
Rahulsharma - 03 12 23
25 pages
Data Mining Clustering PDF
No ratings yet
Data Mining Clustering PDF
15 pages
Machine Learning Guided Project
No ratings yet
Machine Learning Guided Project
23 pages
Machine Learning - Nabeel Khan - Final Project Report - Problem 2
100% (1)
Machine Learning - Nabeel Khan - Final Project Report - Problem 2
24 pages
Anshul Dyundi Machine Learning July 2022
50% (2)
Anshul Dyundi Machine Learning July 2022
46 pages
Car Transport Prediction
100% (2)
Car Transport Prediction
27 pages
Assighment Project 1
100% (3)
Assighment Project 1
18 pages
Data Mining Project: Clustering & PCA
100% (1)
Data Mining Project: Clustering & PCA
44 pages
Data Mining Project: Clustering & Model Analysis
100% (1)
Data Mining Project: Clustering & Model Analysis
40 pages
Sajjad DS
100% (2)
Sajjad DS
97 pages
Advanced Statistics Project FAQ
100% (1)
Advanced Statistics Project FAQ
32 pages
M4 Data Mining W4 Business Report
No ratings yet
M4 Data Mining W4 Business Report
22 pages
Chapter 5 - Classification Problems
100% (1)
Chapter 5 - Classification Problems
25 pages
Machine Learning Solution
100% (1)
Machine Learning Solution
12 pages
Election Prediction & Speech Analysis
No ratings yet
Election Prediction & Speech Analysis
3 pages
Duplication - Typecasting-Problem Statement
100% (1)
Duplication - Typecasting-Problem Statement
3 pages
SMDM-Project Report (Madhur Dhananiwala)
100% (2)
SMDM-Project Report (Madhur Dhananiwala)
43 pages
AS Graded Project Suchi Solanki
No ratings yet
AS Graded Project Suchi Solanki
21 pages
Linear - Regression - Assignment: Problem Statement
100% (3)
Linear - Regression - Assignment: Problem Statement
24 pages
Python Data Preprocessing & Regression
No ratings yet
Python Data Preprocessing & Regression
68 pages
ML Lab6.Ipynb - Colaboratory
100% (1)
ML Lab6.Ipynb - Colaboratory
5 pages
SMDM Report
No ratings yet
SMDM Report
12 pages
Machine Learning Business Report - Compress (AutoRecovered)
100% (3)
Machine Learning Business Report - Compress (AutoRecovered)
69 pages
Python Project Submission by - Ravikanth Govindu: Due Date: 27th Mar 2022
No ratings yet
Python Project Submission by - Ravikanth Govindu: Due Date: 27th Mar 2022
48 pages
Data Science & Business Analytics: Post Graduate Program in
No ratings yet
Data Science & Business Analytics: Post Graduate Program in
16 pages
Sample - Customer Churn Prediction Python Documentation
No ratings yet
Sample - Customer Churn Prediction Python Documentation
33 pages
Advance Statistics Business Report
No ratings yet
Advance Statistics Business Report
15 pages
Heart Attack Prediction Model EDA
100% (1)
Heart Attack Prediction Model EDA
24 pages
15 KNN - Problem Statement
0% (2)
15 KNN - Problem Statement
3 pages
Day13 K Means Clustering
No ratings yet
Day13 K Means Clustering
4 pages
Project 5 - Cars
100% (1)
Project 5 - Cars
22 pages
SMDM Project Instructions & Analysis
50% (2)
SMDM Project Instructions & Analysis
5 pages
Business Report DSBA Data Mining Project - Part 2 Segmentation Using K-Means Clustering
No ratings yet
Business Report DSBA Data Mining Project - Part 2 Segmentation Using K-Means Clustering
28 pages
Mvchine Learning Project Report
No ratings yet
Mvchine Learning Project Report
33 pages
Clustering Documentation R Code
100% (1)
Clustering Documentation R Code
9 pages
Project
No ratings yet
Project
18 pages
SMDM Project: Submitted By: Tina Das
100% (1)
SMDM Project: Submitted By: Tina Das
15 pages
Predictive Modeling
No ratings yet
Predictive Modeling
38 pages
Tutorial 8
No ratings yet
Tutorial 8
12 pages