DBSCAN++: Towards fast and scalable density clustering
Jennifer Jang 1 Heinrich Jiang 2
Abstract 2, it quickly starts to exhibit quadratic behavior in high di-
mensions and/or when n becomes large. In fact, we show in
DBSCAN is a classical density-based clustering Figure 1 that even with a simple mixture of 3-dimensional
arXiv:1810.13105v3 [cs.LG] 17 May 2019
procedure with tremendous practical relevance. Gaussians, DBSCAN already starts to show quadratic be-
However, DBSCAN implicitly needs to compute havior.
the empirical density for each sample point, lead-
ing to a quadratic worst-case time complexity, The quadratic runtime for these density-based procedures
which is too slow on large datasets. We propose can be seen from the fact that they implicitly must compute
DBSCAN++, a simple modification of DBSCAN density estimates for each data point, which is linear time
which only requires computing the densities for a in the worst case for each query. In the case of DBSCAN,
chosen subset of points. We show empirically that, such queries are proximity-based. There has been much
compared to traditional DBSCAN, DBSCAN++ work done in using space-partitioning data structures such
can provide not only competitive performance but as KD-Trees (Bentley, 1975) and Cover Trees (Beygelzimer
also added robustness in the bandwidth hyperpa- et al., 2006) to improve query times, but these structures are
rameter while taking a fraction of the runtime. all still linear in the worst-case. Another line of work that
We also present statistical consistency guarantees has had practical success is in approximate nearest neigh-
showing the trade-off between computational cost bor methods (e.g. Indyk and Motwani (1998); Datar et al.
and estimation rates. Surprisingly, up to a cer- (2004)) which have sub-linear queries, but such methods
tain point, we can enjoy the same estimation rates come with few approximation guarantees.
while lowering computational cost, showing that DBSCAN proceeds by computing the empirical densities
DBSCAN++ is a sub-quadratic algorithm that at- for each sample point and then designating points whose
tains minimax optimal rates for level-set estima- densities are above a threshold as core-points. Then, a neigh-
tion, a quality that may be of independent interest. borhood graph of the core-points is constructed, and the
clusters are assigned based on the connected components.
In this paper, we present DBSCAN++, a step towards a fast
1. Introduction and scalable DBSCAN. DBSCAN++ is based on the obser-
Density-based clustering algorithms such as Mean Shift vation that we only need to compute the density estimates
(Cheng, 1995) and DBSCAN (Ester et al., 1996) have made for a subset m of the n data points, where m can be much
a large impact on a wide range of areas in data analysis, smaller than n, in order to cluster properly. To choose these
including outlier detection, computer vision, and medical m points, we provide two simple strategies: uniform and
imaging. As data volumes rise, non-parametric unsuper- greedy K-center-based sampling. The resulting procedure
vised procedures are becoming ever more important in un- has O(mn) worst-case runtime.
derstanding large datasets. Thus, there is an increasing need We show that with this modification, we still maintain statis-
to establish more efficient versions of these algorithms. In tical consistency guarantees. We show the trade-off between
this paper, we focus on improving the classical DBSCAN computational cost and estimation rates. Interestingly, up
procedure. to a certain point, we can enjoy the same minimax-optimal
It was long believed that DBSCAN had a runtime of estimation rates attained by DBSCAN while making m (in-
O(n log n) until it was proven to be O(n2 ) in the worst stead of the larger n) empirical density queries, thus leading
case by Gan and Tao (2015). They showed that while DB- to a sub-quadratic procedure. In some cases, we saw that
SCAN can run in O(n log n) when the dimension is at most our method of limiting the number of core points can act
as a regularization, thus reducing the sensitivity of classical
1
Uber 2 Google Research. Correspondence to: Jennifer Jang DBSCAN to its parameters.
<j.jang42@gmail.com>.
We show on both simulated datasets and real datasets that
DBSCAN++: Towards fast and scalable density clustering
GPU implementation of DBSCAN that can be over 100x
faster than sequential DBSCAN. In this paper, we assume
a single processor although extending our approach to the
parallel or distributed settings could be a future research
direction.
We now discuss the theoretical work done for DBSCAN.
Despite the practical significance of DBSCAN, its statistical
properties has only been explored recently (Sriperumbudur
and Steinwart, 2012; Jiang, 2017a; Wang et al., 2017; Stein-
wart et al., 2017). Such analyses make use of recent devel-
opments in topological data analysis to show that DBSCAN
estimates the connected components of a level-set of the
underlying density.
Figure 1. Runtime (seconds) vs dataset size to cluster a mixture
of four 3-dimensional Gaussians. Using Gaussian mixtures, we It turns out there has been a long history in estimating
see that DBSCAN starts to show quadratic behavior as the dataset the level-sets of the density function (Hartigan, 1975; Tsy-
gets large. After 106 points, DBSCAN ran too slowly and was bakov et al., 1997; Singh et al., 2009; Rigollet et al., 2009;
terminated after 3 hours. This is with only 3 dimensions. Rinaldo and Wasserman, 2010; Chaudhuri and Dasgupta,
2010; Steinwart, 2011; Balakrishnan et al., 2013; Chaudhuri
DBSCAN++ runs in a fraction of the time compared to et al., 2014; Jiang, 2017b; Chen et al., 2017). However,
DBSCAN, while giving competitive performance and con- most of these methods have little practical value (some
sistently producing more robust clustering scores across are unimplementable), and DBSCAN is one of the only
hyperparameter settings. practical methods that is able to attain the strongest guar-
antees, including finite-sample Hausdorff minimax optimal
2. Related Works rates. In fact the only previous method that was shown to
attain such guarantees was the impractical histogram-based
There has been much work done on finding faster variants of method of Singh et al. (2009), until Jiang (2017a) showed
DBSCAN. We can only highlight some of these works here. that DBSCAN attained almost identical guarantees. This
One approach is to speed up the nearest neighbor queries paper shows that DBSCAN++ can attain similar guarantees
that DBSCAN uses (Huang and Bian, 2009; Vijayalaksmi while being sub-quadratic in computational complexity as
and Punithavalli, 2012; Kumar and Reddy, 2016). Another well as the precise trade-off in estimation rates for further
approach is to find a set of "leader" points that still preserve computational speedup.
the structure of the original data set and then identify clusters
based on the clustering of these "leader" points (Geng et al.,
2000; Viswanath and Pinkesh, 2006; Viswanath and Babu,
3. Algorithm
2009). Our approach of finding core points is similar but is We have n i.i.d. samples X = {x1 , ..., xn } drawn from
simpler and comes with theoretical guarantees. Liu (2006) a distribution F over RD . We now define core-points,
modified DBSCAN by selecting clustering seeds among the which are essentially points with high empirical density de-
unlabeled core points in an orderly manner in order to reduce fined with respect to the two hyperparameters of DBSCAN,
computation time in regions that have already been clustered. minPts and ε. The latter is also known as the bandwidth.
Other heuristics include (Borah and Bhattacharyya, 2004; Definition 1. Let ε > 0 and minPts be a positive integer.
Zhou et al., 2000b; Patwary et al., 2012; Kryszkiewicz and Then x ∈ X is a core-point if |B(x, ε) ∩ X| ≥ minPts,
Lasek, 2010). where B(x, ε) := {x0 : |x − x0 | ≤ ε}.
There are also numerous approaches based on parallel com-
In other words, a core-point is a sample point that has at
puting such as (Xu et al., 1999; Zhou et al., 2000a; Arlia
least minPts sample points within its ε-radius neighborhood.
and Coppola, 2001; Brecheisen et al., 2006; Chen et al.,
2010; Patwary et al., 2012; Götz et al., 2015) including map- DBSCAN (Ester et al., 1996) is shown as Algorithm 1,
reduce based approaches (Fu et al., 2011; He et al., 2011; which is in a more concise but equivalent form to the origi-
Dai and Lin, 2012; Noticewala and Vaghela, 2014). Then nal version (see Jiang (2017a)). It creates a graph G with
there are distributed approaches to DBSCAN where data core-points as vertices and edges connecting core points,
is partitioned across different locations and there may be which are distance at most ε apart. The final clusters are rep-
communication cost constraints (Januzaj et al., 2004b;a; Liu resented by the connected components in this graph along
et al., 2012; Neto et al., 2015; Lulli et al., 2016). It is also with non-core-points that are close to such a connected com-
worth mentioning Andrade et al. (2013), who presented a ponent. The remaining points are designated as noise points
DBSCAN++: Towards fast and scalable density clustering
and are left unclustered. Noise points can be seen as outliers. Algorithm 2 DBSCAN++
Inputs: X, m, ε, minPts
Algorithm 1 DBSCAN S ← sample m points from X.
Inputs: X, ε, minPts C ← all core-points in S w.r.t X, ε, and minPts
C ← core-points in X given ε and minPts G ← empty graph.
G ← initialize empty graph for c ∈ C do
for c ∈ C do Add an edge (and possibly a vertex or vertices) in G
Add an edge (and possibly a vertex or vertices) in G from c to all points in X ∩ B(c, ε)
from c to all points in X ∩ B(c, ε) end for
end for return connected components of G.
return connected components of G.
greedy initialization method for approximating K-center
(Algorithm 3), which repeatedly picks the farthest point
from any point currently in the set. This process contin-
ues until we have a total of m points. This method gives a
2-approximation to the K-center problem.
Algorithm 3 Greedy K-center Initialization
Input: X, m.
S ← {x1 }.
for i from 1 to m − 1 do
Figure 2. Core-points from a mixture of three 2D Gaussians. Each
Add argmaxx∈X mins∈S |x − s| to S.
point marked with a triangle represents a core-point and the shaded
area its ε-neighborhood. The total ε-radii area of DBSCAN++ end for
core-points provides adequate coverage of the dataset. The K- return S.
center initialization produces an even more efficient covering. The
points that are not covered will be designated as outliers. This
illustrates that a strategically selected subset of core points can 3.3. Time Complexity
produce a reasonable ε-neighborhood cover for clustering.
DBSCAN++ has a time complexity of O(nm). Choos-
3.1. Uniform Initialization ing the set S takes linear time for the uniform initialization
method and O(mn) for the greedy K-center approach (Gon-
DBSCAN++, shown in Algorithm 2, proceeds as follows: zalez, 1985). The next step is to find the core-points. We
First, it chooses a subset S of m uniformly sampled points use a KDTree to query for the points within the ε-radii ball
from the dataset X. Then, it computes the empirical density for each point in S. Each such query takes O(n) in the
of points in S w.r.t. the entire dataset. That is, a point worst case, and doing so for m sampled points leads to a
x ∈ S is a core point if |B(x, ε) ∩ X| ≥ minPts. From cost of O(nm). Constructing the graph takes O(mn) time
here, DBSCAN++ builds a similar neighborhood graph G and running a depth-first search on the graph recovers the
of core-points in S and finds the connected components in connected components in O(nm) since the graph will have
G. Finally, it clusters the rest of the unlabeled points to their at most O(nm) edges.
closest core-points. Thus, since it only recovers a fraction
of the core-points, it requires expensive density estimation The last step is to cluster the remaining points to the nearest
queries on only m of the n samples. However, the intuition, core point. We once again use a KDTree, which takes O(m)
as shown in Figure 2, is that a smaller sample of core-points for each of O(n) points, leading to a time complexity of
can still provide adequate coverage of the dataset and lead O(nm) as well. Thus, the time complexity of DBSCAN++
to a reasonable clustering. is O(nm).
3.2. K-Center Initialization 4. Theoretical Analysis
Instead of uniformly choosing the subset of m points at In this section, we show that DBSCAN++ is a consistent
random, we can use K-center (Gonzalez, 1985; Har-Peled, estimator of the density level-sets. It was recently shown by
2011), which aims at finding the subset of size m that mini- Jiang (2017a) that DBSCAN does this with finite-sample
mizes the maximum distance of any point in X to its closest guarantees. We extend this analysis to show that our modi-
point in that subset. In other words, it tries to find the fied DBSCAN++ also has statistical consistency guarantees,
most efficient covering of the sample points. We use the and we show the trade-off between speed and convergence
DBSCAN++: Towards fast and scalable density clustering
rate. 4.3. Level-set estimation result
Definition 2. (Level-set) The λ-level-set of f is defined as We give the estimation rate under the Hausdorff metric.
Lf (λ) := {x ∈ X : f (x) ≥ λ}.
Definition 4 (Hausdorff Distance).
Our results are under the setting that the density level λ is
dHaus (A, A0 ) = max{sup d(x, A0 ), sup d(x0 , A)}.
known and gives insight into how to tune the parameters x∈A x0 ∈A0
based on the desired density level.
Theorem 1. Suppose Assumption 1 and 2 hold, and assume
4.1. Regularity Assumptions the parameter settings in the previous section. There exists
Cl , C sufficiently large and Cu sufficiently small such that
We have n i.i.d. samples X = {x1 , ..., xn } drawn from a \
the following holds with probability at least 1−δ. Let Lf (λ)
distribution F over RD . We take f to be the density of F be the core-points returned by Algorithm 2 under uniform
over the uniform measure on RD . initialization or greedy K-center initialization. Then,
Assumption 1. f is continuous and has compact support
X ⊆ RD . \
dHaus (Lf (λ), Lf (λ))
√ 1/D !
Much of the results will depend on the behavior of level- 2/β 1/D log m
≤C· Cδ,n · minPts−1/2β + Cδ,n · .
set boundaries. Thus, we require sufficient drop-off at the m
boundaries as well as separation between the CCs at a par-
ticular level-set. Proof. There are two quantities to bound: (i)
maxx∈L\ d(x, Lf (λ)), which ensures that the esti-
Define the following shorthands for distance from a point to f (λ)
a set and the neighborhood around a set. mated core-points are not far from the true core-points (i.e.
\
Lf (λ)), and (ii) supx∈Lf (λ) d(x, L f (λ)), which ensures
Definition 3. d(x, A) := inf x0 ∈A |x − x0 |, B(C, r) :=
{x ∈ X : d(x, C) ≤ r}, that the estimates core-points provides a good covering of
the level-set.
Assumption 2 (β-regularity of level-sets). Let 0 < β < ∞.
There exist Č, Ĉ, rc > 0 such that the following holds for The bound for (i) follows by the main result of Jiang
all x ∈ B(Lf (λ), rc )\Lf (λ). (2017a). This is because DBSCAN++’s estimated
core-points are a subset of that of the original DB-
Č · d(x, Lf (λ))β ≤ λ − f (x) ≤ Ĉ · d(x, Lf (λ))β . SCAN procedure. Thus, maxx∈L\ d(x, Lf (λ)) ≤
(λ) f
Remark 1. We can choose any 0 < β < ∞. The β- maxx∈L^ ^
d(x, Lf (λ)), where L f (λ) are the core-points
f (λ)
regularity condition is a standard assumption in level-set returned by original DBSCAN; this quantity is bounded by
analyses. See (Singh et al., 2009). The higher the β, the 2/β
O(Cδ,n · minPts−1/2β ) by Jiang (2017a).
more smooth the density is around the boundary of the level-
set and thus the less salient it is. This makes it more difficult We now turn to the other direction and bound
to recover the level-set. \
supx∈Lf (λ) d(x, Lf (λ)). Let x ∈ Lf (λ).
Suppose we use the uniform initialization. Define r0 :=
4.2. Hyperparameter Settings √
2Cδ,n D log m
1/D
mvD ·λ . Then, we have
In this section, we state the hyperparameter settings in terms
of n, the sample size, and the desired density level λ in order
Z
for statistical√consistency guarantees to hold. Define Cδ,n = f (z) · 1[z ∈ B(x, r0 )]dz ≥ vD r0 D (λ − Ĉr0β )
X
16 log(2/δ) log n, where δ, 0 < δ < 1, is a confidence √
D Cδ,n D log m
parameter which will be used later (i.e. guarantees will hold ≥ vD r0 λ/2 = ,
with probability at least 1 − δ). m
!1/D where the first inequality holds from Assumption 2, the
minPts second inequality holds for n sufficiently large, and the last
ε= √
2 / minPts)
, holds from the conditions on minPts.
n · vD · (λ − λ · Cδ,n
By the uniform ball convergence rates of Lemma 7 of Chaud-
where vD is the volume of the unit ball in RD and minPts huri and Dasgupta (2010), we have that with high probabil-
satisfies ity, there exists sample point x0 ∈ S such that |x − x0 | ≤ r0 .
2D
Cl · (log n)2 ≤ minPts ≤ Cu · (log n) 2+D · n2β/(2β+D) , This is because the ball B(x, r0 ) contains sufficiently high
true mass to be guaranteed a sample point in S. Moreover,
and Cl and Cu are positive constants depending on δ, f . this guarantee holds with high probability uniformly over
DBSCAN++: Towards fast and scalable density clustering
x ∈ X . Next, we show that x0 is a core-point. This follows n D c m
by Lemma 8 of Jiang (2017a), which shows that any sam- (A) iris 150 4 3 3
ple point in x ∈ Lf (λ) satisfies |B(x, ε) ∩ X| ≥ minPts. (B) wine 178 13 3 5
Thus, x0 ∈ L \ \
f (λ). Hence, supx∈Lf (λ) d(x, Lf (λ)) ≤ r0 , (C) spam 1401 57 2 793
as desired. (D) images 210 19 7 24
(E) MNIST 60000 20 10 958
Now suppose we use the greedy K-center initialization.
(F) Libras 360 90 15 84
Define the following attained K-center objective:
(G) mobile 2000 20 4 112
τ := max min d(s, x), (H) zoo 101 16 7 8
x∈X s∈S
(I) seeds 210 19 7 6
and the optimal K-center objective: (J) letters 20000 16 26 551
τopt := min max min d(s, x). (K) phonemes 4509 256 5 396
S 0 ⊆X ,|S 0 |=m x∈X s∈S 0 (L) fashion MNIST 60000 784 10 5674
It is known that the greedy K-center initialization is a 2- (M) celeb-a 10000 40 3 3511
approximation (see Gonzalez (1985); Har-Peled (2011)),
Figure 3. Summary of datasets used. Includes dataset size (n),
thus
number of features (D), number of clusters (c), and the (m) used
τ ≤ 2τopt ≤ 2r0 , by both DBSCAN++ uniform and K-center.
where the last inequality follows with high probability since
the K-center objective will be sub-optimal if we sampled 4.4. Estimating the connected components
the m centers uniformly. Then, we have
The previous section shows that the core-points returned
sup min d(s, x) by DBSCAN++ recovers the density level-set. The more
x∈Lf (λ) s∈S interesting question is about the actual clustering: that is,
≤ max min d(s, x) + dHaus (Lf (λ), X ∩ Lf (λ)) whether DBSCAN++ can recover the connected compo-
x∈X s∈S
nents of the density level-set individually and whether there
≤ τ + r0 ≤ 3r0 . is a 1:1 correspondence between the clusters returned by
The argument then proceeds in the same way as with uni- DBSCAN++ and the connected components.
form initialization but with an extra constant factor, as de- It turns out that to obtain such a result, we need a minor
sired. modification of the procedure. That is, after determining the
Remark 2. When taking minPts to the maximum allowed core points, instead of using the ε cutoff to connect points
rate into the same cluster, we must use a higher cutoff. In fact,
any constant value would do as long as it is sufficiently
minPts ≈ n2β/(2β+D) , smaller than the pairwise distances between the connected
components. For example, for the original DBSCAN al-
we obtain the error rate (ignoring log factors) of
gorithm, many analyses must make this same modification.
−1/(2β+D)
\
dHaus (Lf (λ), Lf (λ)) . n + m−1/D . This is known as pruning false clusters in the literature (see
Kpotufe and von Luxburg (2011); Jiang (2017a)). The same
The first term matches the known lower bound for level- analysis carries over to our modification, and we omit it here.
set estimation established in Theorem 4 of Tsybakov et al. We note that pruning does not change the final estimation
(1997). The second term is the trade-off for computing the rates but may change the initial sample size required.
empirical densities for only m of the points. In particular, if
we take 4.5. Outlier detection
m & nD/(2β+D) , One important application of DBSCAN is outlier detection
(Breunig et al., 2000; Çelik et al., 2011; Thang and Kim,
then the first term dominates, and we thus have 2011). Datapoints not assigned to clusters are noise points
\ −1/(2β+D)
dHaus (Lf (λ), Lf (λ)) . n , the minimax optimal and can be considered outliers. This is because the noise
rate for level-set estimation. This leads to the following points are the low density points away from the clusters and
result. thus tend to be points with few similar representatives in the
Corollary 1. Suppose the conditions of Theorem 1 and set dataset. We show that the noise points DBSCAN++ finds
m ≈ nD/(2β+D) . Then, Algorithm 2 is a minimax optimal are similar to the noise points discovered by DBSCAN++.
estimator (up to logarithmic factors) of the density level-set We give a simple result that shows that every DBSCAN
with sub-quadratic runtime of O(n2−2β/(2β+D) ). noise point is also one DBSCAN++ finds (Lemma 1). Then,
DBSCAN++: Towards fast and scalable density clustering
Figure 4 (Left) shows that the number of noise points of
DBSCAN++ quickly converges to those of DBSCAN as the
ratio m/n increases, which combined with Lemma 1, shows
that the noise points DBSCAN++ returns closely approx-
imates those returned by DBSCAN for m/n sufficiently
high.
Lemma 1 (Noise points). For any dataset, if N0 and N1
are the noise points found by DBSCAN and DBSCAN++
respectively, then as long as they have the same setting of ε
and k, we have that N0 ⊆ N1 .
Proof. Noise points are those that are further than ε dis-
tance away from a core point. The result follows since DB-
SCAN++ core points are a subset of that of DBSCAN.
5. Experiments
5.1. Dataset and setup
We ran DBSCAN++ with uniform and K-center initializa-
tions and compared both to DBSCAN on 11 real datasets as
described in Figure 3. We used Phonemes (Friedman et al.,
2001), a dataset of log periodograms of spoken phonemes,
and MNIST, a sub-sample of the MNIST handwriting recog-
nition dataset after running a PCA down to 20 dimensions.
The rest of the datasets we used are standard UCI or Kaggle
datasets used for clustering. We evaluate the performance
via two widely-used clustering scores: the adjusted RAND
index (Hubert and Arabie, 1985) and adjusted mutual in-
formation score (Vinh et al., 2010), which are computed
against the ground truth. We fixed minPts = 10 for all
procedures throughout experiments.
5.2. Trade-off between accuracy and speed
The theoretical results suggest that up to a certain point, only
computing empirical densities for a subset of the sample
points should not noticeably impact the clustering perfor-
mance. Past that point, we begin to see a trade-off. We
confirm this empirically in Figure 4 (Right), which shows
that indeed past a certain threshold of m/n, the cluster-
ing scores remain high. Only when the sub-sample is too
small do we begin seeing a significant trade-off in clustering Figure 4. Each row corresponds to a dataset. See Figure 3 for
dataset descriptions. Left (Outlier Detection): The number of
scores. This shows that DBSCAN++ can save considerable
outliers (i.e. noise points) returned by DBSCAN against m/n
computational cost while maintaining similar clustering per- and compared it to that of DBSCAN++. We see that the set of
formance as DBSCAN. DBSCAN++ outliers quickly approaches those of DBSCAN’s thus
We further demonstrate this point by applying these proce- showing that DBSCAN++ remains effective at outlier detection
dures to image segmentation, where segmentation is done compared to DBSCAN, especially when m/n is sufficiently high.
Right (Clustering Performance): we plot the clustering accuracy
by clustering the image’s pixels (with each pixel represented
and runtimes for eight real datasets as a function of the ratio m/n.
as a 5-dimensional vector consisting of (x, y) position and As expected, the runtime increases approximately linearly in this
RGB color). Figure 5 shows that DBSCAN++ provides ratio, but the clustering scores consistently attain high values when
a very similar segmentation as DBSCAN in a fraction of m/n is sufficiently large. Interestingly, sometimes we attain higher
the time. While this is just a simple qualitative example, it scores with lower values of m/n thus giving both better runtime
serves to show the applicability of DBSCAN++ to a poten- and accuracy.
tially wide range of problems.
DBSCAN++: Towards fast and scalable density clustering
Figure 5. Figure skater Yuzuru Hanyu at the 2018 Olympics. DB-
SCAN was initiated with hyperparameters ε = 8 and minPts = 10,
and DBSCAN++ with ε = 60, m/n = 0.3, and minPts = 10.
DBSCAN++ with K-centers initialization recovers similar clusters
in the 988 × 750 image as DBSCAN in far less time: 7.38 seconds
versus 44.18 seconds. The speedup becomes more significant on
higher resolution images.
5.3. Robustness to Hyperparameters
In Figure 6, we plot each algorithm’s performance across
a wide range of its hyperparameters. The table in Figure 7
shows the best scores and runtimes for each dataset and algo-
rithm. For these experiments, we chose m = p · nD/(D+4) ,
where 0 < p < 1 was chosen based on validation over just 3
values, as explained in Figure 7. We found that the K-center
initialization required smaller p due to its ability to find a
good covering of the space and more efficiently choose the
sample points to query for the empirical density.
The results in Figure 6 show that DBSCAN++ with uniform
initialization gives competitive performance compared to
DBSCAN but with robustness across a much wider range of
. In fact, in a number of cases, DBSCAN++ was even better
than DBSCAN under optimal tuning. DBSCAN++ with
K-center initialization further improves on the clustering
results of DBSCAN++ for most of the datasets. Pruning the
core-points as DBSCAN++ may act as a regularizing factor
to prevent the algorithm’s dependency on the preciseness of
its parameters.
An explanation of why DBSCAN++ added robustness
across ε follows. When tuning DBSCAN with respect to ε, Figure 6. Clustering performance over range of hyperparame-
we found that DBSCAN often performed optimally on only ter settings. Experimental results on datasets described in Figure 3.
a narrow range of ε. Because ε controls both the designa- Each row corresponds to a single dataset and each column cor-
tion of points as core-points as well as the connectivity of responds to a clustering score. For each dataset and clustering
the core-points, small changes could produce significantly score, we plot the scores for DBSCAN++ with uniform and K-
different clusterings. center sampling vs DBSCAN across a wide range of settings for ε
(x-axis).
DBSCAN++: Towards fast and scalable density clustering
DBSCAN Uniform K-Center DBSCAN Uniform K-Center
(A) 0.5681 0.6163 (±0.01) 0.6634 (A) 3.07 (±0.08) 1.52 (±0.09) 2.55 (±0.34)
0.5768 0.6449 (±0.01) 0.7301 (B) 2.04 (±0.07) 1.31 (±0.07) 0.79 (±0.02)
(B) 0.2851 0.3254 (±0.01) 0.3694 (C) 3308 (±26.4) 225.86 (±6.8) 442.69 (±2.0)
0.3587 0.3605 (±0.00) 0.4148 (D) 4.88 (±0.09) 1.51 (±0.05) 1.32 (±0.04)
(C) 0.2851 0.3254 (±0.01) 0.3694 (E) 1.5e5 (±0.17) 3.5e3 (±39.23) 7.0e3 (±41.1)
0.3587 0.3605 (±0.00) 0.4148 (F) 37.63 (±0.38) 8.20 (±0.22) 9.84 (±0.06)
(D) 0.2922 0.2701 (±0.01) 0.3853 (G) 67.05 (±0.63 11.41 (±0.21) 15.23 (±0.32)
0.4938 0.4289 (±0.01) 0.5600 (H) 1.07 (±0.03) 0.78 (±0.03) 0.81 (±0.03)
(E) 0.0844 0.1097 (±0.00) 0.1416 (I) 1.75 (±0.04) 0.91 (±0.03) 0.97 (±0.09)
0.1743 0.3774 (±0.00) 0.3152 (J) 1.0e5 (±76.43) 5.2e3 (±17.48) 1.5e3 (±36.4)
(F) 0.0939 0.1380 (±0.00) 0.2095 (K) 1.2e4 (±160) 1.9e3 (±32.45) 1.9e3 (±30.4)
0.2170 0.3033 (±0.00) 0.4461 (L) 3.9e9 (±4.3e4) 7.4e8 (±4.1e3) 3.6e8(±307)
(G) 0.0551 0.1741 (±0.00) 0.1091 (M) 4.1e9 (±6.2e4) 3.1e8 (±411) 2.3e8(±1.1e3)
0.2123 0.2585 (±0.00) 0.2418
(H) 0.6846 0.6729 (±0.01) 0.7340 Figure 8. Runtimes (milliseconds) and standard errors for each
0.6347 0.6356 (±0.00) 0.7456 dataset and algorithm. DBSCAN++ using both uniform and K-
(I) 0.4041 0.4991 (±0.02) 0.4402 center initializations performs reasonably well within a fraction
0.4403 0.4843 (±0.02) 0.5057 of the runtime of DBSCAN. The larger the dataset, the less time
(J) 0.0623 0.0488 (±0.00) 0.0901 DBSCAN++ requires compared to DBSCAN, showing that DB-
0.3823 0.3956 (±0.00) 0.3841 SCAN++ scales much better in practice.
(K) 0.5101 0.5541 (±0.01) 0.5364
0.6475 0.6259 (±0.01) 0.6452 cially true on large datasets where it may be costly to iterate
over many hyperparameter settings.
Figure 7. Highest scores for each dataset and algorithm. The
first row is the adjusted RAND index and the second row the 5.4. Performance under optimal tuning
adjusted mutual information. The highest score of the row is
in green while the second highest is in orange. The standard We see that under optimal tuning of each algorithm, DB-
error over 10 runs is given in parentheses for DBSCAN++ with SCAN++ consistently outperforms DBSCAN in both clus-
uniform initialization. Both other algorithms are deterministic. tering scores and runtime. We see in Figure 7 that DB-
Each algorithm was tuned across a range of with minPts = 10. SCAN++ with the uniform initialization consistently out-
For both DBSCAN++ algorithms, we used p values of 0.1, 0.2, performs DBSCAN and DBSCAN++ with K-center ini-
or 0.3. We see that DBSCAN++ uniform performs better than tialization consistently outperforms both of the algorithms.
DBSCAN on 17 out of 22 metrics, while DBSCAN++ K-center
Figure 8 shows that indeed DBSCAN++ gives a speed ad-
performs better than DBSCAN on 21 out of 22 metrics.
vantage over DBSCAN for the runs that attained the optimal
performance. These results thus suggest that not only is
In contrast, DBSCAN++ suffers less from the hyper- DBSCAN++ faster, it can achieve better clusterings.
connectivity of the core-points until ε is very large. It turns
out that only processing a subset of the core-points not 6. Conclusion
only reduces the runtime of the algorithm, but it provides
the practical benefit of reducing the tenuous connections In this paper, we presented DBSCAN++, a modified version
between connected components that are actually far away. of DBSCAN that only computes the density estimates for
This way, DBSCAN++ is much less sensitive to changes in a subset m of the n points in the original dataset. We es-
ε and reaches its saturation point (where there is only one tablished statistical consistency guarantees which show the
cluster) only at very large ε. trade-off between computational cost and estimation rates,
and we prove that interestingly, up to a certain point, we
Performance under optimal tuning is often not available in
can enjoy the same estimation rates while reducing compu-
practice, and this is especially the case in unsupervised prob-
tation cost. We also demonstrate this finding empirically.
lems like clustering where the ground truth is not assumed
We then showed empirically that not only can DBSCAN++
to be known. Thus, not only should procedures produce
scale considerably better than DBSCAN, its performance is
accurate clusterings in the best setting, but it may be even
competitive in accuracy and consistently more robust across
more important for procedures to be precise, easy to tune,
their mutual bandwidth hyperparameters. Such robustness
reasonable across a wide range of its hyperparameter set-
can be highly desirable in practice where optimal tuning is
tings. This added robustness (not to mention speedup) may
costly or unavailable.
make DBSCAN++ a more practical method. This is espe-
DBSCAN++: Towards fast and scalable density clustering
References Yen-Chi Chen, Christopher R Genovese, and Larry Wasser-
man. Density level sets: Asymptotics, inference, and
Guilherme Andrade, Gabriel Ramos, Daniel Madeira,
visualization. Journal of the American Statistical Associ-
Rafael Sachetto, Renato Ferreira, and Leonardo Rocha.
ation, pages 1–13, 2017.
G-dbscan: A gpu accelerated algorithm for density-based
clustering. Procedia Computer Science, 18:369–378, Yizong Cheng. Mean shift, mode seeking, and clustering.
2013. IEEE transactions on pattern analysis and machine intel-
Domenica Arlia and Massimo Coppola. Experiments in ligence, 17(8):790–799, 1995.
parallel clustering with dbscan. In European Conference
Bi-Ru Dai and I-Chang Lin. Efficient map/reduce-based
on Parallel Processing, pages 326–331. Springer, 2001.
dbscan algorithm with optimized data partition. In 2012
Sivaraman Balakrishnan, Srivatsan Narayanan, Alessandro IEEE Fifth International Conference on Cloud Comput-
Rinaldo, Aarti Singh, and Larry Wasserman. Cluster ing, pages 59–66. IEEE, 2012.
trees on manifolds. In Advances in Neural Information
Processing Systems, pages 2679–2687, 2013. Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S
Mirrokni. Locality-sensitive hashing scheme based on
Jon Louis Bentley. Multidimensional binary search trees p-stable distributions. In Proceedings of the twentieth
used for associative searching. Communications of the annual symposium on Computational geometry, pages
ACM, 18(9):509–517, 1975. 253–262. ACM, 2004.
Alina Beygelzimer, Sham Kakade, and John Langford.
Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu,
Cover trees for nearest neighbor. In Proceedings of
et al. A density-based algorithm for discovering clusters
the 23rd international conference on Machine learning,
in large spatial databases with noise. In Kdd, volume 96,
pages 97–104. ACM, 2006.
pages 226–231, 1996.
B Borah and DK Bhattacharyya. An improved sampling-
based dbscan for large spatial databases. In Intelligent Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The
Sensing and Information Processing, 2004. Proceedings elements of statistical learning, volume 1. Springer series
of International Conference on, pages 92–96. IEEE, 2004. in statistics New York, 2001.
Stefan Brecheisen, Hans-Peter Kriegel, and Martin Pfeifle. Yan Xiang Fu, Wei Zhong Zhao, and Hui Fang Ma. Re-
Parallel density-based clustering of complex objects. In search on parallel dbscan algorithm design based on
Pacific-Asia Conference on Knowledge Discovery and mapreduce. In Advanced Materials Research, volume
Data Mining, pages 179–188. Springer, 2006. 301, pages 1133–1138. Trans Tech Publ, 2011.
Markus M Breunig, Hans-Peter Kriegel, Raymond T Ng, Junhao Gan and Yufei Tao. Dbscan revisited: mis-claim, un-
and Jörg Sander. Lof: identifying density-based local fixability, and approximation. In Proceedings of the 2015
outliers. In ACM sigmod record, volume 29, pages 93– ACM SIGMOD International Conference on Management
104. ACM, 2000. of Data, pages 519–530. ACM, 2015.
Mete Çelik, Filiz Dadaşer-Çelik, and Ahmet Şakir Dokuz.
ZHOU Shui Geng, ZHOU Ao Ying, CAO Jing, and HU Yun
Anomaly detection in temperature data using dbscan al-
Fa. A fast density based clustering algorithm [j]. Journal
gorithm. In Innovations in Intelligent Systems and Ap-
of Computer Research and Development, 11:001, 2000.
plications (INISTA), 2011 International Symposium on,
pages 91–95. IEEE, 2011. Teofilo F Gonzalez. Clustering to minimize the maximum
Kamalika Chaudhuri and Sanjoy Dasgupta. Rates of con- intercluster distance. Theoretical Computer Science, 38:
vergence for the cluster tree. In Advances in Neural 293–306, 1985.
Information Processing Systems, pages 343–351, 2010.
Markus Götz, Christian Bodenstein, and Morris Riedel.
Kamalika Chaudhuri, Sanjoy Dasgupta, Samory Kpotufe, Hpdbscan: highly parallel dbscan. In Proceedings of
and Ulrike von Luxburg. Consistent procedures for clus- the workshop on machine learning in high-performance
ter tree estimation and pruning. IEEE Transactions on computing environments, page 2. ACM, 2015.
Information Theory, 60(12):7900–7912, 2014.
Sariel Har-Peled. Geometric approximation algorithms.
Min Chen, Xuedong Gao, and HuiFei Li. Parallel dbscan Number 173. American Mathematical Soc., 2011.
with priority r-tree. In Information Management and
Engineering (ICIME), 2010 The 2nd IEEE International John A Hartigan. Clustering algorithms, volume 209. Wiley
Conference on, pages 508–511. IEEE, 2010. New York, 1975.
DBSCAN++: Towards fast and scalable density clustering
Yaobin He, Haoyu Tan, Wuman Luo, Huajian Mao, Di Ma, Jinfei Liu, Joshua Zhexue Huang, Jun Luo, and Li Xiong.
Shengzhong Feng, and Jianping Fan. Mr-dbscan: an ef- Privacy preserving distributed dbscan clustering. In Pro-
ficient parallel density-based clustering algorithm using ceedings of the 2012 Joint EDBT/ICDT Workshops, pages
mapreduce. In Parallel and Distributed Systems (IC- 177–185. ACM, 2012.
PADS), 2011 IEEE 17th International Conference on,
Alessandro Lulli, Matteo Dell’Amico, Pietro Michiardi,
pages 473–480. IEEE, 2011.
and Laura Ricci. Ng-dbscan: scalable density-based
Ming Huang and Fuling Bian. A grid and density based fast clustering for arbitrary data. Proceedings of the VLDB
spatial clustering algorithm. In Artificial Intelligence and Endowment, 10(3):157–168, 2016.
Computational Intelligence, 2009. AICI’09. International Antonio Cavalcante Araujo Neto, Ticiana Linhares Coelho
Conference on, volume 4, pages 260–263. IEEE, 2009. da Silva, Victor Aguiar Evangelista de Farias, José Anto-
nio F Macêdo, and Javam de Castro Machado. G2p: A
Lawrence Hubert and Phipps Arabie. Comparing partitions.
partitioning approach for processing dbscan with mapre-
Journal of classification, 2(1):193–218, 1985.
duce. In International Symposium on Web and Wire-
Piotr Indyk and Rajeev Motwani. Approximate nearest less Geographical Information Systems, pages 191–202.
neighbors: towards removing the curse of dimensionality. Springer, 2015.
In Proceedings of the thirtieth annual ACM symposium Maitry Noticewala and Dinesh Vaghela. Mr-idbscan: Effi-
on Theory of computing, pages 604–613. ACM, 1998. cient parallel incremental dbscan algorithm using mapre-
duce. International Journal of Computer Applications,
Eshref Januzaj, Hans-Peter Kriegel, and Martin Pfeifle. 93(4), 2014.
Dbdc: Density based distributed clustering. In Inter-
national Conference on Extending Database Technology, Mostofa Ali Patwary, Diana Palsetia, Ankit Agrawal, Wei-
pages 88–105. Springer, 2004a. keng Liao, Fredrik Manne, and Alok Choudhary. A new
scalable parallel dbscan algorithm using the disjoint-set
Eshref Januzaj, Hans-Peter Kriegel, and Martin Pfeifle. Scal- data structure. In Proceedings of the International Con-
able density-based distributed clustering. In European ference on High Performance Computing, Networking,
Conference on Principles of Data Mining and Knowledge Storage and Analysis, page 62. IEEE Computer Society
Discovery, pages 231–244. Springer, 2004b. Press, 2012.
Heinrich Jiang. Density level set estimation on manifolds Philippe Rigollet, Régis Vert, et al. Optimal rates for plug-in
with dbscan. In International Conference on Machine estimators of density level sets. Bernoulli, 15(4):1154–
Learning, pages 1684–1693, 2017a. 1178, 2009.
Alessandro Rinaldo and Larry Wasserman. Generalized
Heinrich Jiang. Uniform convergence rates for kernel den-
density clustering. The Annals of Statistics, pages 2678–
sity estimation. In International Conference on Machine
2722, 2010.
Learning, pages 1694–1703, 2017b.
Aarti Singh, Clayton Scott, Robert Nowak, et al. Adaptive
Samory Kpotufe and Ulrike von Luxburg. Pruning nearest hausdorff estimation of density level sets. The Annals of
neighbor cluster trees. arXiv preprint arXiv:1105.0540, Statistics, 37(5B):2760–2782, 2009.
2011.
Bharath Sriperumbudur and Ingo Steinwart. Consistency
Marzena Kryszkiewicz and Piotr Lasek. Ti-dbscan: Clus- and rates for clustering with dbscan. In Artificial Intelli-
tering with dbscan by means of the triangle inequality. gence and Statistics, pages 1090–1098, 2012.
In International Conference on Rough Sets and Current
Ingo Steinwart. Adaptive density level set clustering. In
Trends in Computing, pages 60–69. Springer, 2010.
Proceedings of the 24th Annual Conference on Learning
K Mahesh Kumar and A Rama Mohan Reddy. A fast dbscan Theory, pages 703–738, 2011.
clustering algorithm by accelerating neighbor searching Ingo Steinwart, Bharath K Sriperumbudur, and Philipp
using groups method. Pattern Recognition, 58:39–48, Thomann. Adaptive clustering using kernel density esti-
2016. mators. arXiv preprint arXiv:1708.05254, 2017.
Bing Liu. A fast density-based clustering algorithm for Tran Manh Thang and Juntae Kim. The anomaly detec-
large databases. In Machine Learning and Cybernet- tion by using dbscan clustering with multiple parameters.
ics, 2006 International Conference on, pages 996–1000. In Information Science and Applications (ICISA), 2011
IEEE, 2006. International Conference on, pages 1–5. IEEE, 2011.
DBSCAN++: Towards fast and scalable density clustering
Alexandre B Tsybakov et al. On nonparametric estimation
of density level sets. The Annals of Statistics, 25(3):
948–969, 1997.
S Vijayalaksmi and M Punithavalli. A fast approach to
clustering datasets using dbscan and pruning algorithms.
International Journal of Computer Applications, 60(14),
2012.
Nguyen Xuan Vinh, Julien Epps, and James Bailey. Informa-
tion theoretic measures for clusterings comparison: Vari-
ants, properties, normalization and correction for chance.
Journal of Machine Learning Research, 11(Oct):2837–
2854, 2010.
P Viswanath and V Suresh Babu. Rough-dbscan: A fast
hybrid density based clustering method for large data sets.
Pattern Recognition Letters, 30(16):1477–1488, 2009.
P Viswanath and Rajwala Pinkesh. l-dbscan: A fast hybrid
density based clustering method. In Pattern Recogni-
tion, 2006. ICPR 2006. 18th International Conference on,
volume 1, pages 912–915. IEEE, 2006.
Daren Wang, Xinyang Lu, and Alessandro Rinaldo. Opti-
mal rates for cluster tree estimation using kernel density
estimators. arXiv preprint arXiv:1706.03113, 2017.
Xiaowei Xu, Jochen Jäger, and Hans-Peter Kriegel. A fast
parallel clustering algorithm for large spatial databases.
In High Performance Data Mining, pages 263–290.
Springer, 1999.
Aoying Zhou, Shuigeng Zhou, Jing Cao, Ye Fan, and Yunfa
Hu. Approaches for scaling dbscan algorithm to large
spatial databases. Journal of computer science and tech-
nology, 15(6):509–526, 2000a.
Shuigeng Zhou, Aoying Zhou, Wen Jin, Ye Fan, and Wein-
ing Qian. Fdbscan: a fast dbscan algorithm. Ruan Jian
Xue Bao, 11(6):735–744, 2000b.
DBSCAN++: Towards fast and scalable density clustering
Appendix
A. Comparison to other methods
In this section, we compare DBSCAN++ against replacing the nearest-neighbor search needed for DBSCAN with an
approximate nearest neighbor method using the FLANN (https://www.cs.ubc.ca/research/flann/) library, and we call it ANN
DBSCAN.
Figure 9. DBSCAN using approximate nearest neighbors vs. DBSCAN++ vs. DBSCAN. Experimental results on a synthetic dataset
of 10,000 points drawn from five 50-dimensional uniform distributions run on DBSCAN++, DBSCAN, and DBSCAN using a fast
approximate nearest neighbors algorithm from the FLANN library. DBSCAN++ was run with K-center initialization and m/n = 0.1.
All algorithms were run with minP ts = 10. ANN DBSCAN shows a comparable speedup to DBSCAN++ but poorer performance
compared to both DBSCAN and DBSCAN++, whereas DBSCAN++ shows both comparable performance to DBSCAN and comparable
runtime to ANN DBSCAN.