0% found this document useful (0 votes)

168 views9 pages

3.unsupervised Learning

The document discusses unsupervised learning and the k-means clustering algorithm. It provides examples of applying k-means clustering to customer spending data to discover patterns, such as groups of customers with different spending behaviors. Visualizations of the clustering results show how customers are grouped based on their average and maximum spending. Metrics like the silhouette score are computed to evaluate the clustering results.

Uploaded by

Alexandra Veres

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

168 views9 pages

3.unsupervised Learning

Uploaded by

Alexandra Veres

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

2016. 11. 27.

Unsupervised learning

About this module

The goal of unsupervised learning is to model patterns that are hidden in the data. For
example, in our retail dataset there may be groups of customers with particular behaviours,
e.g. customers that use the shop for expensive items, customers that use the shop only with a
small budget, customers that use the website only in some periods of the year, and so on. With
unsupervised learning we can discover these kinds of pattern and summarise them.

The analysis that allows us to discover and consolidate patterns is called unsupervised because
we do not know what groups there are in the data or the group membership of any individual
observation. In this case, we say that the data is unlabelled. The most common unsupervised
learning method is clustering, where patterns are discovered by grouping samples.

Clustering with K-Means

K-means clustering is a method for finding clusters and cluster centres in a set of unlabelled
data. Intuitively, we might think of a cluster as comprising a group of data points whose inter-
point distances are small compared with the distances to points outside of the cluster. Given
an initial set of K centres, the K-means algorithm alternates the two steps:

1. for each centre we identify the subset of training points (its cluster) that is closer to it than
any other centre;

2. the mean of each feature for the data points in each cluster are computed, and the
corresponding vector of means becomes the new centre for that cluster.

These two steps are iterated until the centres no longer move or the assignments no longer
change. Then, a new point x can be assigned to the cluster of the closest prototype.

Run K-Means with two features

Isolate the features mean_spent and max_spent , then run the K-Means algorithm on the
resulting dataset using K=2 (in sklearn, it is n_clusters = 2 ) and visualise the result.

http://beta.cambridgespark.com/courses/jpm/03module.html 1/9
2016. 11. 27. Unsupervised learning

PYTHON
# Apply k-means with 2 cluster using a subset of the features
# (mean_spent and max_spent)

Xsub = X[:,1:3]
n_clusters = 2

kmeans = KMeans(n_clusters = n_clusters)

kmeans.fit(Xsub) (1)

# use the fitted model to predict what the cluster of each customer should be
cluster_assignment = kmeans.predict(Xsub) (2)
cluster_assignment

1. The method fit runs the K-Means algorithm on the data that we pass to it.

2. The method predict returns a cluster label for each sample in the data.

PYTHON
# Visualise the clusters using a scatter plot or scatterplot matrix if you wish

data = [
Scatter(
x = Xsub[cluster_assignment == i, 0],
y = Xsub[cluster_assignment == i, 1],
mode = 'markers',
name = 'cluster '+ str(i)
) for i in range(n_clusters)
]

layout = Layout(
xaxis = dict(title = 'max_spent'),
yaxis = dict(title = 'mean_spent'),
height= 600,
)

fig = dict(data = data, layout = layout)

iplot(fig)

http://beta.cambridgespark.com/courses/jpm/03module.html 2/9
2016. 11. 27. Unsupervised learning

Figure 1. K-Means clustering results with two features.

The separation between the two clusters is neat (the two clusters can be separated with a line).
One cluster contains customers with low spendings and the second with high spendings.

Run K-Means with all the features

Run K-Means using all the available features and visualise the result in the subspace created
by mean_spent and max_spent .

PYTHON
# Apply k-means with 2 clusters using all the features

PYTHON
# Adapt the visualisation code accordingly

This is what you should observe:

http://beta.cambridgespark.com/courses/jpm/03module.html 3/9
2016. 11. 27. Unsupervised learning

Figure 2. K-Means clustering results with all features.

The result is now different. The first cluster contains customers with a maximum spending
close to the minimum mean spending and the second contains customers with a maximum
spending far from the minimum mean spending. This way can tell apart customers that could
be willing to buy objects that cost more than their average spending.

Question: Why can’t the clusters be separated with a line as before?

Compare expenditure between clusters

Select the feature 'mean_spent' (or any feature of your choice) and compare the two clusters
obtained. Can you interpret the output of these commands?

PYTHON
# Compare expenditure between clusters

feat = 1

cluster0_desc = pd.DataFrame(X[cluster_assignment == 0, feat],

columns=['cluster0']).describe()

cluster1_desc = pd.DataFrame(X[cluster_assignment == 1, feat],

columns=['cluster1']).describe()

compare_df = pd.concat((cluster0_desc, cluster1_desc), axis=1)

compare_df

http://beta.cambridgespark.com/courses/jpm/03module.html 4/9
2016. 11. 27. Unsupervised learning

Figure 3. Descriptive statistics of the clusters.

Compare expenditure with box plots

Compare the distribution of the feature mean_spent in the two clusters using a box plot.

PYTHON
# Create a boxplot of the two clusters for 'mean_spent'

data = [
Box(
y = X[cluster_assignment == i, feat],
name = 'cluster'+ str(i),
) for i in range(n_clusters)
]

layout = Layout(
xaxis = dict(title = "Clusters"),
yaxis = dict(title = "Value"),
showlegend=False
)

fig = dict(data = data, layout = layout)

iplot(fig)

http://beta.cambridgespark.com/courses/jpm/03module.html 5/9
2016. 11. 27. Unsupervised learning

Figure 4. Boxplot of mean expenditure for each cluster.

Compare the mean expenditure distributions

Use the function create_distplot from FigureFactory to show the distribution of the
mean expenditure in both clusters.

PYTHON
# Compare mean expediture with a histogram

# Add histogram data

x1 = X[cluster_assignment == 0, feat]
x2 = X[cluster_assignment == 1, feat]

# Group data together

hist_data = [x1, x2]
group_labels = ['Cluster 1', 'Cluster 2']

fig = FF.create_distplot(hist_data, group_labels, bin_size=.2)

iplot(fig)

http://beta.cambridgespark.com/courses/jpm/03module.html 6/9
2016. 11. 27. Unsupervised learning

Figure 5. Mean expenditure distribution per cluster.

Here we note:

Cluster 0 contains more customers.

Customers in cluster 1 spend more in average

There is more variability in the behaviour of the Customers in cluster 1

Looking at the centroids

Look at the centroids of the clusters kmeans.cluster_centers_ and check the values of the
centres in for the features 'mean_spent' and 'max_spent'.

PYTHON
# Compare the centroids

We can see that the centres coincide with the means of each cluster in the table above.

Compute the silhouette score

Compute the silhouette score of the clusters resulting from the application of K-Means. The
Silhouette Coefficient is calculated using the mean intra-cluster distance (a) and the mean
nearest-cluster distance (b) for each sample. The Silhouette Coefficient for a sample is (b - a)
/ max(a, b) . It represents how similar a sample is to the samples in its own cluster compared
to samples in other clusters. The best value is 1 and the worst value is -1. Values near 0
indicate overlapping clusters. Negative values generally indicate that a sample has been
assigned to the wrong cluster, as a different cluster is more similar.

http://beta.cambridgespark.com/courses/jpm/03module.html 7/9
2016. 11. 27. Unsupervised learning

PYTHON
# Compute the silhouette score

print('silhouette_score', silhouette_score(X, cluster_assignment))

> ('silhouette_score', 0.451526633737)

K-Means, pro and cons

Pro:

fast, if your dataset is big K-Means might be the only option

easy to understand

any unseen point can be assigned to the cluster with the closest mean to the point

many implementations available

Cons:

you need to guess the number of clusters

clusters can be only globular

the results depends on the initial choice of the means

all the points are assigned to a cluster, clusters are affected by noise

Comparison of algorithms
The chart below shows the characteristics of different clustering algorithms implemented in
sklearn on simple 2D datasets.

http://beta.cambridgespark.com/courses/jpm/03module.html 8/9
2016. 11. 27. Unsupervised learning

Figure 6. Comparison of different clustering algorithms.

Here we note that K-Means works pretty well in case of globular clusters but it doesn’t
produce good results on the clusters that have circular and half moon shapes. Instead, Linkage
and DBSCAN are able to deal with these kind of cluster shapes.

The snippet to generate the chart can be found at http://scikit-

learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html.

Wrap up of Module 3
Clustering is an unsupervised way to generate groups out of your data

Each clustering algorithm has its benefits and pitfalls

Some clustering algorithms, like DBSCAN, have an embedded outlier detection mechanism

Silhouette score can be used to measure how compact the clusters are

Last updated 2016-11-25 07:29:37 GMT

http://beta.cambridgespark.com/courses/jpm/03module.html 9/9

Python Machine Learning
No ratings yet
Python Machine Learning
19 pages
ML Unit 4
No ratings yet
ML Unit 4
110 pages
Artificial Intelligence Report
No ratings yet
Artificial Intelligence Report
23 pages
K Means
No ratings yet
K Means
25 pages
Zara
No ratings yet
Zara
47 pages
ML - K-Means
No ratings yet
ML - K-Means
12 pages
21AI71 Module 5 Textbook
No ratings yet
21AI71 Module 5 Textbook
25 pages
Ds Un4
No ratings yet
Ds Un4
11 pages
Unit II Final
No ratings yet
Unit II Final
152 pages
LP I Assignment A4 Clustering
No ratings yet
LP I Assignment A4 Clustering
13 pages
JAVIER KMeans Clustering Jupyter Notebook
No ratings yet
JAVIER KMeans Clustering Jupyter Notebook
7 pages
U1 - KMeans - 5th Sem - DS
No ratings yet
U1 - KMeans - 5th Sem - DS
14 pages
K Means Clustering
No ratings yet
K Means Clustering
11 pages
K Means
No ratings yet
K Means
9 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
Probability and Statistics Mansoura Day4
No ratings yet
Probability and Statistics Mansoura Day4
23 pages
Peer Eval
No ratings yet
Peer Eval
6 pages
K Means Clustering - Experiment 12
No ratings yet
K Means Clustering - Experiment 12
3 pages
K Means Clustering
No ratings yet
K Means Clustering
5 pages
Unit 4
No ratings yet
Unit 4
19 pages
PeerEval Unsupervised
No ratings yet
PeerEval Unsupervised
6 pages
SLide#4 - Clustering and Elbow Technique
No ratings yet
SLide#4 - Clustering and Elbow Technique
29 pages
Machine Learning
No ratings yet
Machine Learning
23 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
23 pages
Module 3
No ratings yet
Module 3
21 pages
Unit 4-Unsupervised Learning-K Means and Hierarchical Clustering
No ratings yet
Unit 4-Unsupervised Learning-K Means and Hierarchical Clustering
48 pages
22AIP3101A Session 9
No ratings yet
22AIP3101A Session 9
38 pages
"These Are Just Rough Notes For References" What Is K-Means Clustering
No ratings yet
"These Are Just Rough Notes For References" What Is K-Means Clustering
9 pages
Unit 4
No ratings yet
Unit 4
63 pages
Unsupervised Learning: Clustering & Anomaly Detection
No ratings yet
Unsupervised Learning: Clustering & Anomaly Detection
50 pages
Unit 4
No ratings yet
Unit 4
46 pages
K Means - Ipynb - Colab
No ratings yet
K Means - Ipynb - Colab
10 pages
K Mean Notes
No ratings yet
K Mean Notes
5 pages
DSUP Exp5
No ratings yet
DSUP Exp5
7 pages
AI With Python - Unsupervised Learning - Clustering
No ratings yet
AI With Python - Unsupervised Learning - Clustering
12 pages
Cluster Analysis Techniques Guide
No ratings yet
Cluster Analysis Techniques Guide
37 pages
04-FSSR DS610 2024 2025T1 Kmeans
No ratings yet
04-FSSR DS610 2024 2025T1 Kmeans
57 pages
Python K-Means Clustering Guide
No ratings yet
Python K-Means Clustering Guide
6 pages
10.lab Activity
No ratings yet
10.lab Activity
11 pages
Lesson 6 - Unsupervised Learning
No ratings yet
Lesson 6 - Unsupervised Learning
63 pages
ML Clustering2
No ratings yet
ML Clustering2
11 pages
K-Means Clustering Explained
No ratings yet
K-Means Clustering Explained
6 pages
4 Clustring
No ratings yet
4 Clustring
48 pages
Exercises - Dss - Partd - Handout
No ratings yet
Exercises - Dss - Partd - Handout
12 pages
Lect 10 - Unsupervised Learning
No ratings yet
Lect 10 - Unsupervised Learning
50 pages
Day 3
No ratings yet
Day 3
74 pages
Unit IV
No ratings yet
Unit IV
96 pages
UNIT - 3 - Clustering
No ratings yet
UNIT - 3 - Clustering
21 pages
Introduction To Unsupervised Learning:: Clustering
No ratings yet
Introduction To Unsupervised Learning:: Clustering
21 pages
Aam Unit 4 QB With Answer
No ratings yet
Aam Unit 4 QB With Answer
11 pages
ML Unit 4 V1
No ratings yet
ML Unit 4 V1
30 pages
Cluster Lecture-1
No ratings yet
Cluster Lecture-1
20 pages
Overview of Clustering:: UNIT-5
No ratings yet
Overview of Clustering:: UNIT-5
27 pages
How To Perform Clustering Algorithms in Machine Learning
No ratings yet
How To Perform Clustering Algorithms in Machine Learning
9 pages
ML Unit4
No ratings yet
ML Unit4
19 pages
Chapter 1 Introduction
No ratings yet
Chapter 1 Introduction
49 pages
Interpersonal-Effectiveness Give
No ratings yet
Interpersonal-Effectiveness Give
3 pages
NGO Advocacy
No ratings yet
NGO Advocacy
27 pages
Interpersonal-Effectiveness Dear-Man
No ratings yet
Interpersonal-Effectiveness Dear-Man
3 pages
Experiments With A New Boosting Algorithm: Yoav Freund Robert E. Schapire
No ratings yet
Experiments With A New Boosting Algorithm: Yoav Freund Robert E. Schapire
9 pages
Mining and Visualising Real-World Data: About This Module
100% (1)
Mining and Visualising Real-World Data: About This Module
16 pages
ML: Decision Trees & Random Forests
No ratings yet
ML: Decision Trees & Random Forests
25 pages
Nuclear Fission and Fusion Basics
No ratings yet
Nuclear Fission and Fusion Basics
9 pages
Phonons: Heat Capacity & Conductivity
No ratings yet
Phonons: Heat Capacity & Conductivity
12 pages
Lattice Dynamics and Wave Propagation
No ratings yet
Lattice Dynamics and Wave Propagation
11 pages
Diffraction From Crystals: Recap of Bragg's Law and Miller Indices
No ratings yet
Diffraction From Crystals: Recap of Bragg's Law and Miller Indices
15 pages
Aiml Notes Handwritten
No ratings yet
Aiml Notes Handwritten
95 pages
Project Presentation
No ratings yet
Project Presentation
18 pages
Artificial Intelligence Notes
No ratings yet
Artificial Intelligence Notes
20 pages
Data Science Lab Assignment: Credit Risk Prediction Using Random Forest Name: Vemula Yaminee Jyothsna Roll No: 20BM6JP44
No ratings yet
Data Science Lab Assignment: Credit Risk Prediction Using Random Forest Name: Vemula Yaminee Jyothsna Roll No: 20BM6JP44
3 pages
HARSH's Resume
No ratings yet
HARSH's Resume
1 page
Predicting Bank Insolvencies Using Machine Learning Techniques
No ratings yet
Predicting Bank Insolvencies Using Machine Learning Techniques
42 pages
Chandu Zeroth Review
No ratings yet
Chandu Zeroth Review
15 pages
AI's Impact on Open Innovation
No ratings yet
AI's Impact on Open Innovation
16 pages
Report - Final Eval 2
No ratings yet
Report - Final Eval 2
18 pages
Resume Karun Sharma
No ratings yet
Resume Karun Sharma
1 page
Research On Maize Seed
No ratings yet
Research On Maize Seed
16 pages
Artificial Intelligence-Based Voice Assistant
No ratings yet
Artificial Intelligence-Based Voice Assistant
6 pages
Spwla 2019 CC
No ratings yet
Spwla 2019 CC
13 pages
Unit 6 Test 3 Grade 11
0% (1)
Unit 6 Test 3 Grade 11
5 pages
LMR-SRM-Prof. S. Magesh
No ratings yet
LMR-SRM-Prof. S. Magesh
27 pages
Midterm Lab Exam - Attempt Review
No ratings yet
Midterm Lab Exam - Attempt Review
17 pages
Internship Report Presentation On
No ratings yet
Internship Report Presentation On
11 pages
Unit 3 & 4 (p18)
No ratings yet
Unit 3 & 4 (p18)
18 pages
What Is Natural Language Processing?
No ratings yet
What Is Natural Language Processing?
5 pages
Neural Network: BY, Deekshitha J P Rakshitha Shankar
No ratings yet
Neural Network: BY, Deekshitha J P Rakshitha Shankar
27 pages
Hidden Markov Models for Experts
No ratings yet
Hidden Markov Models for Experts
59 pages
Convolutional Neural Networks: Computer Vision CS 543 / ECE 549 University of Illinois Jia-Bin Huang
No ratings yet
Convolutional Neural Networks: Computer Vision CS 543 / ECE 549 University of Illinois Jia-Bin Huang
76 pages
Data Description Toolbox DD Tools 2.0.0
No ratings yet
Data Description Toolbox DD Tools 2.0.0
47 pages
Student Profile Modeling Using Boosting Algorithms
No ratings yet
Student Profile Modeling Using Boosting Algorithms
13 pages
Ca 02
No ratings yet
Ca 02
15 pages
The Basics of AI For Legal Ops
No ratings yet
The Basics of AI For Legal Ops
12 pages
Zong & Guan, 2024, AI-driven Intelligent Data Analytics and Predictive Analysis in Industry 4.0 Transforming Knowledge, Innovation, and Efficiency
No ratings yet
Zong & Guan, 2024, AI-driven Intelligent Data Analytics and Predictive Analysis in Industry 4.0 Transforming Knowledge, Innovation, and Efficiency
40 pages
AI-Driven Punctuation Restoration
No ratings yet
AI-Driven Punctuation Restoration
5 pages
Teaching Community College Composition Students How To Use AI Tec
No ratings yet
Teaching Community College Composition Students How To Use AI Tec
228 pages
Video Violence Detection with MoBiLSTM
No ratings yet
Video Violence Detection with MoBiLSTM
12 pages

3.unsupervised Learning

Uploaded by

3.unsupervised Learning

Uploaded by

2016. 11. 27.

About this module

Clustering with K-Means

Run K-Means with two features

kmeans = KMeans(n_clusters = n_clusters)

fig = dict(data = data, layout = layout)

Figure 1. K-Means clustering results with two features.

Run K-Means with all the features

This is what you should observe:

Figure 2. K-Means clustering results with all features.

Question: Why can’t the clusters be separated with a line as before?

Compare expenditure between clusters

cluster0_desc = pd.DataFrame(X[cluster_assignment == 0, feat],

cluster1_desc = pd.DataFrame(X[cluster_assignment == 1, feat],

compare_df = pd.concat((cluster0_desc, cluster1_desc), axis=1)

Figure 3. Descriptive statistics of the clusters.

Compare expenditure with box plots

fig = dict(data = data, layout = layout)

Figure 4. Boxplot of mean expenditure for each cluster.

Compare the mean expenditure distributions

# Add histogram data

# Group data together

fig = FF.create_distplot(hist_data, group_labels, bin_size=.2)

Figure 5. Mean expenditure distribution per cluster.

Cluster 0 contains more customers.

Customers in cluster 1 spend more in average

There is more variability in the behaviour of the Customers in cluster 1

Looking at the centroids

Compute the silhouette score

print('silhouette_score', silhouette_score(X, cluster_assignment))

K-Means, pro and cons

fast, if your dataset is big K-Means might be the only option

many implementations available

you need to guess the number of clusters

clusters can be only globular

the results depends on the initial choice of the means

Figure 6. Comparison of different clustering algorithms.

The snippet to generate the chart can be found at http://scikit-

Each clustering algorithm has its benefits and pitfalls

Last updated 2016-11-25 07:29:37 GMT

You might also like