Density Based
Density Based
Density Based
Density-Based Clustering
Basic Idea:
Clusters are dense regions in the
data space, separated by regions of
lower object density
Why Density-Based Clustering?
Results of a k-medoid
algorithm for k=4
-Neighborhood
-Neighborhood Objects within a radius of
from an object.
N ( p ) : {q | d ( p, q ) }
pp
-Neighborhood of p
-Neighborhood of q
Density of p is high (MinPts = 4)
Density of q is low (MinPts = 4)
= 1unit, MinPts = 5
Example
M, P, O, and R are core objects since each is
in an Eps neighborhood containing at least 3
points
Minpts = 3
Eps=radius
of the
circles
Density-Reachability
Directly density-reachable
An object q is directly density-reachable from
object p if p is a core object and q is in ps neighborhood.
pp
MinPts = 4
University at Buffalo The State University of New York
Density-reachability
Density-Reachable (directly and
indirectly):
A point p is directly density-reachable from p2;
p2 is directly density-reachable from p1;
p1 is directly density-reachable from q;
pp2p1q form a chain.
p1
q
p2
p is (indirectly) density-reachable
from q
q is not density- reachable from p?
MinPts = 7
Density-Connectivity
Density-reachable is not symmetric
not good enough to describe clusters
Density-Connected
A pair of points p and q are density-connected
if they are commonly density-reachable from a
point o.
p
Density-connectivity is
symmetric
o
University at Buffalo The State University of New York
Formal Description of
Cluster
Given a data set D, parameter and threshold
MinPts.
A cluster C is a subset of objects satisfying
two criteria:
Connected: p,q C: p and q are densityconnected.
Maximal: p,q: if p C and q is density-reachable
from p, then q C. (avoid redundancy)
P is a core object.
University at Buffalo The State University of New York
Review of Concepts
Is an object o in a cluster
or an outlier?
Is o a core object?
Is o density-reachable by
some core object?
Directly densityreachable
DBSCAN Algorithm
Input: The data set D
Parameter: , MinPts
For each object p in D
if p is a core object and not processed then
C = retrieve all objects density-reachable from p
mark all objects in C as processed
report C as a cluster
else mark p as outlier
end if
End For
DBScan Algorithm
DBSCAN Algorithm:
Example
Parameter
= 2 cm
MinPts = 3
for each o D do
if o is not yet classified then
if o is a core-object then
collect all objects density-reachable from o
and assign them to a new cluster.
else
assign o to NOISE
University at Buffalo The State University of New York
DBSCAN Algorithm:
Example
Parameter
= 2 cm
MinPts = 3
for each o D do
if o is not yet classified then
if o is a core-object then
collect all objects density-reachable from o
and assign them to a new cluster.
else
assign o to NOISE
University at Buffalo The State University of New York
DBSCAN Algorithm:
Example
Parameter
= 2 cm
MinPts = 3
for each o D do
if o is not yet classified then
if o is a core-object then
collect all objects density-reachable from o
and assign them to a new cluster.
else
assign o to NOISE
University at Buffalo The State University of New York
MinPts = 5
C1
C1
3. Otherwise mark p as
processed and put all
the neighbors in cluster
C
P1
C1
C1
C1
C1
C1
C1
Example
Original Points
Original Points
Clusters
Resistant to Noise
Can handle clusters of different shapes and sizes
University at Buffalo The State University of New York
(MinPts=4, Eps=9.92).
Original Points
p
q
3-distance(q) :
first valley
Objects
border object
Heuristic method:
Fix a value for MinPts (default: 2 d 1)
User selects border object o from the MinPts-distance plot;
is set to MinPts-distance(o)
University at Buffalo The State University of New York
G
G1
B
D
B
A, B, C
G3
G2
3-Distance
D1
D2
B, D, E
B, D, F, G
D1, D2,
G1, G2, G3
Objects
Disadvantages
Input parameters may be difficult to determine
In some situations very sensitive to input parameter
setting
University at Buffalo The State University of New York
OPTICS cont
Produces a special order of the database wrt its
density-based clustering structure
This cluster-ordering contains info equiv to the
density-based clusterings corresponding to a broad
range of parameter settings
Good for both automatic and interactive cluster
analysis, including finding intrinsic clustering
structure
Can be represented graphically or using
visualization techniques
Density-Based Hierarchical
Clustering
Observation: Dense clusters are completely contained
by less dense clusters
C
C1
C2
Idea: Process objects in the right order and keep track of point
density in their neighborhood
C
C1
MinPts = 3
C2
2 1
reachability-distance,MinPts(p, o)
smallest distance such that p is
directly density-reachable from o
(if that distance is ?otherwise)
p
q
core-distance(o)
reachability-distance(p,o)
reachability-distance(q,o)
OPTICS: Extension of
DBSCAN
Order points by shortest reachability distance to
guarantee that clusters w.r.t. higher density are finished
first. (for a constant MinPts, higher
density requires lower )
Output:
order of points
core-distance of points
reachability-distance of points
ControlList
foreach o Database
// initially, o.processed = false for all objects o
if o.processed = false;
insert (o, ?) into ControlList;
while ControlList is not empty
database
select first element (o, r-dist) from ControlList;
retrieve N(o) and determine c_dist= core-distance(o);
set o.processed = true;
write (o, r_dist, c_dist) to file;
if o is a core object at any distance
foreach p N(o) not yet processed;
determine r_distp = reachability-distance(p, o);
if (p, _) ControlList
insert (p, r_distp) in ControlList;
else if (p, old_r_dist) ControlList and r_distp old_r_dist
update (p, r_distp) in ControlList;
University at Buffalo The State University of New York
cluster-ordered
file
OPTICS: Properties
Flat density-based clusters wrt. * and MinPts afterwards:
Starts with an object o where c-dist(o) * and r-dist(o) > *
Continues while r-dist *
2
3
17
16
18
34
16
18
Performance: approx.
Core-distance
Reachability-distance
runtime( DBSCAN(
, MinPts) )
O( n * runtime( -neighborhood-query) )
without spatial index support (worst case): O( n2 )
e.g. tree-based spatial index support: O( n log(n) )
17
reachability distance
reachability distance
cluster ordering
University at Buffalo The State University of New York
cluster ordering
MinPts = 10, = 10
MinPts = 10, = 5
MinPts = 2, = 10
3
1
An Example of OPTICS
neighboring objects stay close to each other in a linear
sequence.
Reachabilitydistance
undefined
DBSCAN VS OPTICS
DBSCAN
OPTICS
Density
Boolean value
(high/low)
Numerical value
(core distance)
Densityconnected
Boolean value
(yes/no)
Numerical value
(reachability distance)
Searching
strategy
random
greedy
Denclue: Technical
Essence
Model density by the notion of influence
Each data object exert influence on its neighborhood.
The influence decreases with distance
Example:
Consider each object is a radio, the closer you are to the
object, the louder the noise
y
Gaussian
( x) e
d ( x, y )2
2 2
Density Function
Density Definition is defined as the sum of the
influence functions of all data points.
D
Gaussian
( x) i 1 e
N
d ( x , xi ) 2
2 2
f Gaussian ( x , y ) e
f
D
Gaussian
d ( x , y )2
2 2
( x) i 1 e
D
Gaussian
d ( x , xi ) 2
2 2
( x, xi ) i 1 ( xi x) e
N
d ( x , xi ) 2
2 2
Denclue: Technical
Essence
Clusters can be determined mathematically
by identifying density attractors.
Density attractors are local maximum of the
overall density function.
Density Attractor
Cluster Definition
Center-defined cluster
A subset of objects attracted by an attractor x
density(x)
Arbitrary-shape cluster
A group of center-defined clusters which are
connected by a path P
For each object x on P, density(x) .
Center-Defined and
Arbitrary
Features of DENCLUE
Major features
Solid mathematical foundation
Compact definition for density and cluster
Flexible for both center-defined clusters and arbitraryshape clusters
But needs a large number of parameters
: parameter to calculate density
: density threshold
: parameter to calculate attractor