8000 Faster vectorized version of DBSCAN · scikit-learn/scikit-learn@e48ade5 · GitHub
[go: up one dir, main page]

Skip to content

Commit e48ade5

Browse files
committed
Faster vectorized version of DBSCAN
This benchmarks 20%-40% faster for me than current head, and ~90% faster than 0.15.2 It also was faster than using a precomputed distance matrix, even when the data set was small enough (I used 10k and 50k coordinates from Twitter).
1 parent bf00f08 commit e48ade5

File tree

1 file changed

+18
-9
lines changed

1 file changed

+18
-9
lines changed

sklearn/cluster/dbscan_.py

Lines changed: 18 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55

66
# Author: Robert Layton <robertlayton@gmail.com>
77
# Joel Nothman <joel.nothman@gmail.com>
8+
# Erich Schubert <erich.schubert@gmail.com>
89
#
910
# License: BSD 3 clause
1011

@@ -134,6 +135,9 @@ def dbscan(X, eps=0.5, min_samples=5, metric='minkowski',
134135
# label_num is the label given to the new cluster
135136
label_num = 0
136137

138+
# Candidates
139+
ctmp = np.zeros(X.shape[0], dtype=np.int8)
140+
137141
# Look at all samples and determine if they are core.
138142
# If they are then build a new cluster from them.
139143
for index in core_samples:
@@ -144,16 +148,21 @@ def dbscan(X, eps=0.5, min_samples=5, metric='minkowski',
144148
labels[index] = label_num
145149

146150
# candidates for new core samples in the cluster.
147-
candidates = [index]
151+
candidates = neighborhoods[index]
152+
candidates = candidates[labels.take(candidates) == -1]
153+
148154
while len(candidates) > 0:
149-
cand_neighbors = np.concatenate(np.take(neighborhoods, candidates,
150-
axis=0).tolist())
151-
cand_neighbors = np.unique(cand_neighbors)
152-
noise = cand_neighbors[labels.take(cand_neighbors) == -1]
153-
labels[noise] = label_num
154-
# A candidate is a core point in the current cluster that has
155-
# not yet been used to expand the current cluster.
156-
candidates = np.intersect1d(noise, core_samples, assume_unique=True)
155+
# Label neighbors:
156+
labels[candidates] = label_num
157+
# Keep only core points:
158+
candidates = np.intersect1d(candidates, core_samples, assume_unique=True)
159+
if len(candidates) == 0: break
160+
# Get neighbors of core points:
161+
for idx in np.take(neighborhoods, candidates, axis=0):
162+
ctmp[idx] = 1
163+
# Update candidates for next round of cluster expansion.
164+
ctmp[labels > -1] = 0
165+
candidates = np.flatnonzero(ctmp)
157166
# Current cluster finished.
158167
# Next core point found will start a new cluster.
159168
label_num += 1

0 commit comments

Comments
 (0)
0