8000 [MRG] Faster vectorization of DBSCAN (plain python) by kno10 · Pull Request #4334 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG] Faster vectorization of DBSCAN (plain python) #4334

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 23 additions & 10 deletions sklearn/cluster/dbscan_.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@

# Author: Robert Layton <robertlayton@gmail.com>
# Joel Nothman <joel.nothman@gmail.com>
# Erich Schubert <erich.schubert@gmail.com>
#
# License: BSD 3 clause

Expand Down Expand Up @@ -137,6 +138,10 @@ def dbscan(X, eps=0.5, min_samples=5, metric='minkowski',
# label_num is the label given to the new cluster
label_num = 0

# Temporary boolean mask to join the candidates of the next round
# Where each 1 is a candidate for the next iteration
candidates_mask = np.zeros(X.shape[0], dtype=np.int8)

# Look at all samples and determine if they are core.
# If they are then build a new cluster from them.
for index in core_samples:
Expand All @@ -147,18 +152,26 @@ def dbscan(X, eps=0.5, min_samples=5, metric='minkowski',
labels[index] = label_num

# candidates for new core samples in the cluster.
candidates = [index]
candidates = neighborhoods[index]
candidates = candidates[labels.take(candidates) == -1]

# For performance reasons with numpy, we perform bulk expansion
# of neighborhoods instead of expanding one point at a time
while len(candidates) > 0:
# The tolist() is needed for NumPy 1.6.
cand_neighbors = np.concatenate(np.take(neighborhoods, candidates,
axis=0).tolist())
cand_neighbors = np.unique(cand_neighbors)
noise = cand_neighbors[labels.take(cand_neighbors) == -1]
labels[noise] = label_num
# A candidate is a core point in the current cluster that has
# not yet been used to expand the current cluster.
candidates = np.intersect1d(noise, core_samples,
# Label neighbors:
labels[candidates] = label_num
# Keep only core points:
candidates = np.intersect1d(candidates, core_samples,
assume_unique=True)
if len(candidates) == 0:
break
# Get neighbors of core points (list-of-lists):
for idx in np.take(neighborhoods, candidates, axis=0):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not ctmp[np.take(neighbors, candidates, axis=0)] = 1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed somewhere before. Does not work, because it is of type list(list(int)) or something like this, depending on the index used - candidates will usually be contain more than one object, so it's the neighbors of more than one, and not a proper way of indexing an array anymore. Or so, I'm not a numpy syntax guru.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed np.take returns list of integer arrays of variable lengths in that case so direct indexed assignment is not possible.

The following seems to work:

ctmp[np.concatenate(np.take(neighborhoods, candidates, axis=0))] = 1

but the additional memory allocation is probably not worth it.

# Join neighbors into candidates_mask:
candidates_mask[idx] = 1
# Unset all candidates that have already been labeled
candidates_mask[labels > -1] = 0
candidates = np.flatnonzero(candidates_mask)
# Current cluster finished.
# Next core point found will start a new cluster.
label_num += 1
Expand Down
0