0% found this document useful (0 votes)

51 views60 pages

ML4 Unsupervised Learning

Uploaded by

andesong88

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

51 views60 pages

ML4 Unsupervised Learning

Uploaded by

andesong88

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 60

Unsupervised Learning

TS. Dinh Dong Luong

Contents
 Introduction

 Distance and Similarity Measurement

 K-Means Clustering

 Hierarchical Clustering

 Density-based Clustering

2
Supervised learning vs. unsupervised
learning
 Supervised learning: discover patterns in the
data that relate data attributes with a target
(class) attribute (labelled).
 These patterns are then utilized to predict the
values of the target attribute in future data
instances.
 Unsupervised learning: The data have no
target attribute (non lablelled).
 We want to explore the data to find some intrinsic
structures in them.
3
Introduction
 Data clustering concerns how to group a set of objects
based on their similarity of attributes and/or their
proximity in the vector space.
 A good clustering method will produce high quality
clusters with
 high intra-class similarity: cohesive within clusters
 low inter-class similarity: distinctive between clusters
 Clustering is often called an unsupervised learning
task
 Clustering is often considered synonymous with
unsupervised learning. In fact, association rule mining is
also unsupervised

4
An illustration
 The data set has three natural groups of data points,
i.e., 3 natural clusters.

5
Stages in clustering

6
Aspects of clustering
 A clustering algorithm
 Partitional clustering
 Hierarchical clustering
 …
 A distance (similarity, or dissimilarity) function
 Clustering quality
 Inter-clusters distance  maximized
 Intra-clusters distance  minimized
 The quality of a clustering result depends on the
algorithm, the distance function, and the
application.

7
Similarity and Distance Measurement
 Key to clustering. “similarity” and
“dissimilarity” can also commonly used terms.
 There are numerous distance functions for
 Different types of data
 Numeric data
 Nominal data

8
Distance functions for
numeric attributes

9
Pearson Correlation

10
Trend Similarity (Pearson Correlation)

11
Similarity Measurements

12
Similarity Measurements

13
Similarity Measurements

14
Similarity Measurements

15
Similarity Measurements

16
Similarity Measurements

17
Euclidean distance and Manhattan distance
 Euclidean distance
dist (xi , x j )  ( xi1  x j1 ) 2  ( xi 2  x j 2 ) 2  ...  ( xir  x jr ) 2

 Weighted Euclidean distance

dist (xi , x j )  w1 ( xi1  x j1 ) 2  w2 ( xi 2  x j 2 ) 2  ...  wr ( xir  x jr ) 2

18
Squared distance and Chebychev distance
 Squared Euclidean distance: to place
progressively greater weight on data points
that are further apart.
dist (xi , x j )  ( xi1  x j1 ) 2  ( xi 2  x j 2 ) 2  ...  ( xir  x jr ) 2

 Chebychev distance: one wants to define two

data points as "different" if they are different
on any one of the attributes.
dist(xi , x j )  max(| xi1  x j1 |, | xi 2  x j 2 |, ..., | xir  x jr |)

19
Distance function for text documents
 A text document consists of a sequence of
sentences and each sentence consists of a
sequence of words.
 To simplify: a document is usually considered a
“bag” of words in document clustering.
 Sequence and position of words are ignored.
 A document is represented with a vector just like a
normal data point.
 It is common to use similarity to compare two
documents rather than distance.
 The most commonly used similarity function is the cosine
similarity. We will study this later.

20
Data standardization
 In the Euclidean space, standardization of attributes
is recommended so that all attributes can have
equal impact on the computation of distances.
 Consider the following pair of data points
 xi: (0.1, 20) and xj: (0.9, 720).

dist ( x i , x j )  (0.9  0.1) 2  (720  20) 2  700.000457 ,

 The distance is almost completely dominated by

(720-20) = 700.
 Standardize attributes: to force the attributes to have
a common value range
21
How to choose a clustering algorithm
Choosing the “best” algorithm is a challenge.
 Every algorithm has limitations and works well with certain
data distributions.
 It is very hard, if not impossible, to know what distribution
the application data follow. The data may not fully follow
any “ideal” structure or distribution required by the
algorithms.
 One also needs to decide how to standardize the data, to
choose a suitable distance function and to select other
parameter values.

22
K-means clustering
 K-means is a partitional clustering algorithm
 Let the set of data points (or instances) D be
{x1, x2, …, xn},
where xi = (xi1, xi2, …, xir) is a vector in a real-
valued space X  Rr, and r is the number of
attributes (dimensions) in the data.
 The k-means algorithm partitions the given
data into k clusters.
 Each cluster has a cluster center, called centroid.
 k is specified by the user

23
K-means algorithm
 Given k, the k-means algorithm works as
follows:
1)Randomly choose k data points (seeds) to be the
initial centroids, cluster centers
2)Assign each data point to the closest centroid
3)Re-compute the centroids using the current
cluster memberships.
4)If a convergence criterion is not met, go to 2).

24
An example

+
+

25
An example (cont …)

26
Weaknesses of k-means: Problems with
outliers

27
Weaknesses of k-means: To deal with
outliers
 One method is to remove some data points in the
clustering process that are much further away from
the centroids than other data points.
 To be safe, we may want to monitor these possible outliers
over a few iterations and then decide to remove them.
 Another method is to perform random sampling.
Since in sampling we only choose a small subset of
the data points, the chance of selecting an outlier is
very small.
 Assign the rest of the data points to the clusters by
distance or similarity comparison, or classification

28
Weaknesses of k-means (cont …)
 The algorithm is sensitive to initial seeds.

29
Weaknesses of k-means (cont …)
 If we use different seeds: good results
 There are some
methods to help
choose good
seeds

30
Use frequent values to represent cluster
 This method is mainly for clustering of
categorical data (e.g., k-modes clustering).
 Main method used in text clustering, where a
small set of frequent words in each cluster is
selected to represent the cluster.

31
Clusters of arbitrary shapes
 Hyper-elliptical and hyper-
spherical clusters are usually
easy to represent, using their
centroid together with spreads.
 Irregular shape clusters are hard
to represent. They may not be
useful in some applications.
 Using centroids are not suitable
(upper figure) in general
 K-means clusters may be more
useful (lower figure), e.g., for making
2 size T-shirts.

32
Hierarchical Clustering
 Produce a nested sequence of clusters, a tree.

33
Types of hierarchical clustering
 Agglomerative (bottom up) clustering: It builds the
tree from the bottom level, and
 merges the most similar (or nearest) pair of clusters
 stops when all the data points are merged into a single
cluster (i.e., the root cluster).
 Divisive (top down) clustering: It starts with all data
points in one cluster, the root.
 Splits the root into a set of child clusters. Each child cluster
is recursively divided further
 stops when only singleton clusters of individual data points
remain, i.e., each cluster with only a single point

34
Agglomerative clustering

It is more popular then divisive methods.

 At the beginning, each data point forms a
cluster (also called a node).
 Merge nodes/clusters that have the least
distance.
 Go on merging

 Eventually all nodes belong to one cluster

35
Agglomerative clustering algorithm

36
An example: working of the algorithm

37
Measuring the distance of two clusters
 A few ways to measure distances of two
clusters.
 Results in different variations of the
algorithm.
 Single link
 Complete link
 Average link
 Centroids
 …

38
Single link method
 The distance between
two clusters is the
distance between two
closest data points in
the two clusters, one
data point from each
cluster.
 It can find arbitrarily
shaped clusters, but Two natural clusters are
 It may cause the
split into two
undesirable “chain effect”
by noisy points

39
Complete link method
 The distance between two clusters is the distance
of two furthest data points in the two clusters.
 It is sensitive to outliers because they are far
away

40
Average link and centroid methods
Average link: A compromise
between
 the sensitivity of complete-link
clustering to outliers and
 the tendency of single-link
clustering to form long chains
that do not correspond to the
intuitive notion of clusters as
compact, spherical objects.
 In this method, the distance
between two clusters is the
average distance of all pair-
wise distances between the
data points in two clusters.
41
Average link and centroid methods
 Centroid method: In this method, the distance
between two clusters is the distance between
their centroids

42
Density-based Approaches

 Why Density-Based Clustering methods?

 Discover clusters of arbitrary shape.
 Clusters – Dense regions of objects separated by
regions of low density
 DBSCAN – the first density based clustering
 OPTICS – density based cluster-ordering
 DENCLUE – a general density-based
description of cluster and clustering
Density-Based Clustering
Basic Idea:
Clusters are dense regions in the
data space, separated by regions of
lower object density

 Why Density-Based Clustering?

Results of a
k-medoid
algorithm for
k=4
Density Based Clustering: Basic Concept
 Intuition for the formalization of the basic idea
 For any point in a cluster, the local point density
around that point has to exceed some threshold
 The set of points from one cluster is spatially

connected
 Local point density at a point p defined by two
parameters
 e – radius for the neighborhood of point p:

Ne (p) := {q in data set D | dist(p, q)  e}

 MinPts – minimum number of points in the given

neighbourhood N(p)
e-Neighborhood
 e-Neighborhood – Objects within a radius of e
from an object.
N e ( p) : {q | d ( p, q)  e }
 “High density” - ε-Neighborhood of an object
contains at least MinPts of objects.

ε-Neighborhood of p
ε ε
ε-Neighborhood of q
Density of p is “high” (MinPts = 4)
q p Density of q is “low” (MinPts = 4)
Core, Border & Outlier
Given e and MinPts,
Outlier categorize the objects into
three exclusive groups.
Border
A point is a core point if it has more
than a specified number of points
Core (MinPts) within Eps These are
points that are at the interior of a
cluster.
A border point has fewer than
e = 1unit, MinPts = 5 MinPts within Eps, but is in the
neighborhood of a core point.

A noise point is any point that

is not a core point nor a border
point.
DBSCAN Algorithm

Input: The data set D

Parameter: e, MinPts
For each object p in D
if p is a core object and not processed then
C = retrieve all objects density-reachable from p
mark all objects in C as processed
report C as a cluster
else mark p as outlier
end if
End For

DBScan Algorithm
DBSCAN: The Algorithm

 Arbitrary select a point p

 Retrieve all points density-reachable from p wrt Eps and MinPts.

 If p is a core point, a cluster is formed.

 If p is a border point, no points are density-reachable from p and

DBSCAN visits the next point of the database.

 Continue the process until all of the points have been processed.
DBSCAN Algorithm: Example

 Parameter
 e = 2 cm
 MinPts = 3

for each o  D do
if o is not yet classified then
if o is a core-object then
collect all objects density-reachable from o
and assign them to a new cluster.
else
assign o to NOISE
DBSCAN Algorithm: Example

 Parameter
 e = 2 cm
 MinPts = 3

for each o  D do
if o is not yet classified then
if o is a core-object then
collect all objects density-reachable from o
and assign them to a new cluster.
else
assign o to NOISE
Example

Original Points Point types: core,

border and outliers
e = 10, MinPts = 4
When DBSCAN Works Well

Original Points Clusters

• Resistant to Noise
• Can handle clusters of different shapes and sizes
When DBSCAN Does NOT Work Well

(MinPts=4, Eps=9.92).

Original Points

• Cannot handle Varying

densities
• sensitive to parameters
(MinPts=4, Eps=9.75)
DBSCAN: Sensitive to Parameters
Density Based Clustering: Discussion

 Advantages
 Clusters can have arbitrary shape and size
 Number of clusters is determined automatically
 Can separate clusters from surrounding noise
 Can be supported by spatial index structures
 Disadvantages
 Input parameters may be difficult to determine
 In some situations very sensitive to input
parameter setting
Denclue: Technical Essence

 Influence functions: (influence of y on x,  is a

user given constant)
 Square : f ysquare(x) = 0, if dist(x,y) > ,
1, otherwise

 Guassian:
d ( x, y )2

f y
Gaussian ( x)  e 2 2
Density Function

 Density Definition is defined as the sum of the

influence functions of all data points.

d ( x , xi ) 2


N
D
( x) 
2
2
f Gaussian i 1
e
60

Unsupervised Learning: Clustering
No ratings yet
Unsupervised Learning: Clustering
57 pages
Lecture 4.6 Unsupervised-Learning Clustering
No ratings yet
Lecture 4.6 Unsupervised-Learning Clustering
60 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
83 pages
22AIP3101A Session 9
No ratings yet
22AIP3101A Session 9
38 pages
Lecture 2.1.1 To 2.1.2
No ratings yet
Lecture 2.1.1 To 2.1.2
97 pages
Session 37 CO4 Unsupervised Learning
No ratings yet
Session 37 CO4 Unsupervised Learning
34 pages
Clustering
No ratings yet
Clustering
38 pages
ML Unit III
No ratings yet
ML Unit III
82 pages
cz4041 10 Clustering
No ratings yet
cz4041 10 Clustering
67 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
Unsupervised Learning 1
No ratings yet
Unsupervised Learning 1
40 pages
Unsupervised Learning for Students
No ratings yet
Unsupervised Learning for Students
59 pages
Clustering
No ratings yet
Clustering
80 pages
Chapter 5. Clustering Algorithms-Stud
No ratings yet
Chapter 5. Clustering Algorithms-Stud
44 pages
Cluster
100% (1)
Cluster
72 pages
Clustering
No ratings yet
Clustering
65 pages
Unit 4 Clustering - K-Means and Hierarchical
No ratings yet
Unit 4 Clustering - K-Means and Hierarchical
40 pages
Clustering Part1
No ratings yet
Clustering Part1
84 pages
Week6 Clustering Regression
No ratings yet
Week6 Clustering Regression
101 pages
ML Module 4 2022 1 PDF
No ratings yet
ML Module 4 2022 1 PDF
31 pages
Chapter 6
No ratings yet
Chapter 6
62 pages
Clustering
No ratings yet
Clustering
35 pages
Lecture8 Unsupervised Learning
No ratings yet
Lecture8 Unsupervised Learning
58 pages
Unit 5
No ratings yet
Unit 5
63 pages
Unsupervised Learning Explained
No ratings yet
Unsupervised Learning Explained
54 pages
L5 Clustering
No ratings yet
L5 Clustering
6 pages
Introduction To Unsupervised Learning:: Clustering
No ratings yet
Introduction To Unsupervised Learning:: Clustering
21 pages
Unsupervised Algorithms Unit3
No ratings yet
Unsupervised Algorithms Unit3
53 pages
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
No ratings yet
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
40 pages
Clustering
No ratings yet
Clustering
125 pages
Final ML Unit3 May24
No ratings yet
Final ML Unit3 May24
154 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
AIMLB PGP 2024 Session 12
No ratings yet
AIMLB PGP 2024 Session 12
46 pages
Clustering
No ratings yet
Clustering
104 pages
P 3.1.3 Hierarchical
No ratings yet
P 3.1.3 Hierarchical
30 pages
Clustering Techniques Overview
No ratings yet
Clustering Techniques Overview
40 pages
Clustering Lecture
No ratings yet
Clustering Lecture
46 pages
Datamining Lect8
No ratings yet
Datamining Lect8
79 pages
ML Unsupervised
No ratings yet
ML Unsupervised
35 pages
FML Unit4
No ratings yet
FML Unit4
14 pages
Clustering
No ratings yet
Clustering
84 pages
Module 3 - 1
No ratings yet
Module 3 - 1
149 pages
Clustering Class
No ratings yet
Clustering Class
103 pages
Unsupervised Learning: Clustering
No ratings yet
Unsupervised Learning: Clustering
12 pages
Unsupervised Machine Learning Techniques
No ratings yet
Unsupervised Machine Learning Techniques
58 pages
Clustering
No ratings yet
Clustering
80 pages
M5
No ratings yet
M5
40 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
84 pages
ML - 8
No ratings yet
ML - 8
70 pages
UNIT5
No ratings yet
UNIT5
60 pages
Lect 10 - Unsupervised Learning
No ratings yet
Lect 10 - Unsupervised Learning
50 pages
Unsupervised Learning Modi
No ratings yet
Unsupervised Learning Modi
16 pages
Clustering Part1
No ratings yet
Clustering Part1
79 pages
IT3080 Lecture04 2023
No ratings yet
IT3080 Lecture04 2023
56 pages
Clustering
No ratings yet
Clustering
75 pages
Clustering Techniques Overview
No ratings yet
Clustering Techniques Overview
21 pages
Cluster Analysis Concepts & Algorithms
No ratings yet
Cluster Analysis Concepts & Algorithms
82 pages
Unsupervised Learning: K-Means Clustering
No ratings yet
Unsupervised Learning: K-Means Clustering
23 pages
Comprehensive Chemometrics: Chemical and Biochemical Data Analysis 2nd Edition Steven Brown (Editor) Download
100% (4)
Comprehensive Chemometrics: Chemical and Biochemical Data Analysis 2nd Edition Steven Brown (Editor) Download
77 pages
Data Warehousing & Mining Guide
100% (1)
Data Warehousing & Mining Guide
3 pages
Machine Learning in Drug Discovery and Development Part 1: A Primer
No ratings yet
Machine Learning in Drug Discovery and Development Part 1: A Primer
14 pages
CST466 Datamining Syllabus
No ratings yet
CST466 Datamining Syllabus
13 pages
Rsobia: Remote Sensing Object Based Image Analysis
No ratings yet
Rsobia: Remote Sensing Object Based Image Analysis
21 pages
Week 5 PSOSM - NPTEL
No ratings yet
Week 5 PSOSM - NPTEL
6 pages
Journal of Service Research: A Technology Readiness-Based Taxonomy of Customers: A Replication and Extension
No ratings yet
Journal of Service Research: A Technology Readiness-Based Taxonomy of Customers: A Replication and Extension
12 pages
Clustering Interval-Censored Time-Series For Disease Phenotyping
No ratings yet
Clustering Interval-Censored Time-Series For Disease Phenotyping
32 pages
Movie Recommendation System Using Content Based Filtering Ijariie14954
No ratings yet
Movie Recommendation System Using Content Based Filtering Ijariie14954
16 pages
DMBI Viva
No ratings yet
DMBI Viva
18 pages
Mini Project Report Sample
No ratings yet
Mini Project Report Sample
25 pages
CS273a Final Exam
No ratings yet
CS273a Final Exam
9 pages
IEEE - Conference Paper Final
No ratings yet
IEEE - Conference Paper Final
5 pages
AIML Algorithms and Applications in VLSI Design and Technology
No ratings yet
AIML Algorithms and Applications in VLSI Design and Technology
41 pages
Factor Selection For Delay Analysis Using Knowledge Discovery in Databases
No ratings yet
Factor Selection For Delay Analysis Using Knowledge Discovery in Databases
11 pages
Machine Learning in Crash Simulations
100% (2)
Machine Learning in Crash Simulations
6 pages
Bayesian Nonparametrics Lecture Notes
No ratings yet
Bayesian Nonparametrics Lecture Notes
108 pages
Learning With Fractional Orthogonal Kernel Classifiers in Support Vector Machines
No ratings yet
Learning With Fractional Orthogonal Kernel Classifiers in Support Vector Machines
312 pages
Text-To-Picture Tools, Systems, and Approaches: A Survey
No ratings yet
Text-To-Picture Tools, Systems, and Approaches: A Survey
27 pages
Wolverine Traffic and Road Condition Estimation Using 1wml9whzmo
No ratings yet
Wolverine Traffic and Road Condition Estimation Using 1wml9whzmo
6 pages
M.Tech.: Data Science & Engineering
No ratings yet
M.Tech.: Data Science & Engineering
20 pages
Aiml Lab Manual - Srmtrpec - 20 Feb 24
No ratings yet
Aiml Lab Manual - Srmtrpec - 20 Feb 24
86 pages
21ai71 Simp Tie (1) - 250107 - 124440
No ratings yet
21ai71 Simp Tie (1) - 250107 - 124440
19 pages
RGPV Syllabus Btech Cs 7 Sem cs703 B Data Mining and Warehousing
No ratings yet
RGPV Syllabus Btech Cs 7 Sem cs703 B Data Mining and Warehousing
1 page
ADBMS Lab Manual
100% (1)
ADBMS Lab Manual
69 pages
Python Programming Course Plan
0% (1)
Python Programming Course Plan
30 pages
Quiz on Regression, Clustering, & Association
No ratings yet
Quiz on Regression, Clustering, & Association
3 pages
Fuzzy Logic & Neural Networks Quiz
No ratings yet
Fuzzy Logic & Neural Networks Quiz
6 pages
Summary Business Analytics
No ratings yet
Summary Business Analytics
24 pages
Python & ML Course: AI, EDA, DL Basics
No ratings yet
Python & ML Course: AI, EDA, DL Basics
4 pages

ML4 Unsupervised Learning

Uploaded by

ML4 Unsupervised Learning

Uploaded by

Unsupervised Learning

TS. Dinh Dong Luong

 Distance and Similarity Measurement

 Weighted Euclidean distance

 Chebychev distance: one wants to define two

dist ( x i , x j )  (0.9  0.1) 2  (720  20) 2  700.000457 ,

 The distance is almost completely dominated by

It is more popular then divisive methods.

 Eventually all nodes belong to one cluster

 Why Density-Based Clustering methods?

 Why Density-Based Clustering?

Ne (p) := {q in data set D | dist(p, q)  e}

A noise point is any point that

Input: The data set D

 Arbitrary select a point p

 Retrieve all points density-reachable from p wrt Eps and MinPts.

 If p is a core point, a cluster is formed.

 If p is a border point, no points are density-reachable from p and

Original Points Point types: core,

Original Points Clusters

• Cannot handle Varying

 Influence functions: (influence of y on x,  is a

 Density Definition is defined as the sum of the

You might also like