-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG+1] optimize DBSCAN by rewriting in Cython #4157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
||
cimport cython | ||
from libcpp.vector cimport vector | ||
cimport numpy as np |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Annoyingly, cimport numpy
increases the source file size by thousands of lines. Using memoryviews is even worse.
0e7d52e
to
d4da410
Compare
d4da410
to
63f9e1d
Compare
@@ -111,7 +114,14 @@ def dbscan(X, eps=0.5, min_samples=5, metric='minkowski', | |||
neighbors_model.fit(X) | |||
neighborhoods = neighbors_model.radius_neighbors(X, eps, | |||
return_distance=False) | |||
neighborhoods = np.array(neighborhoods) | |||
|
|||
# Ugly hack to make dtype=object, ndim=1 array when all neighborhoods |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hopefully this will be fixed in neighbors module soon. But what's wrong with
tmp = np.empty(neighborhoods.shape[0], dtype='O')
tmp[:] = neighborhoods.tolist()
neighborhoods = tmp
it's a little bit clearer, perhaps?
This looks functionally correct; I think readability could be improved for users that are not especially familiar with dbscan. The basic routine is to find connected components in core points, while labelling non-core points. On the topic of which, how does this compare to just using |
Interesting idea, but I don't yet see how to add the border points to the components. After puzzling for some time, I decided to profile without handling the border points: def dbscan(X, eps=0.5, min_samples=5, metric='minkowski',
algorithm='auto', leaf_size=30, p=2, sample_weight=None,
random_state=None):
if not eps > 0.0:
raise ValueError("eps must be positive.")
X = check_array(X, accept_sparse='csr')
if sample_weight is not None:
sample_weight = np.asarray(sample_weight)
check_consistent_length(X, sample_weight)
neighbors_model = NearestNeighbors(radius=eps, algorithm=algorithm,
leaf_size=leaf_size,
metric=metric, p=p)
neighbors_model.fit(X)
G = neighbors_model.radius_neighbors_graph(X, eps, mode='connectivity')
is_core = np.asarray(G.sum(axis=0)).ravel() > min_samples
core_samples = np.where(is_core)[0]
core_labels = connected_components(G[core_samples][:, core_samples],
directed=False)[1]
labels = -np.ones(X.shape[0], dtype=np.intp)
labels[is_core] = core_labels
# Add the border points: assign non-core points to some core point
# within an eps radius, if any.
#noise_to_core = G[np.where(~is_core)][:, core_samples]
return core_samples, labels Alas, this is >30% slower than the C++ version (which may still benefit from the graph representation). I also don't see how to keep the randomization in this version. |
thanks for trying :) |
There has been a suggestion to get rid of the randomization, on the basis Regarding the non-core points, the equivalent of the original would be Gprime = G[:, core_ordered_by_component] Obviously there is some cost in that too... When you say >30% slower, does that mean roughly same as master? I wonder On 27 January 2015 at 09:23, Lars notifications@github.com wrote:
|
As fast as master, indeed, and the relative time includes
Putting in a |
Yes, I suspected the slicing was the slow part. Thanks for trying! On 27 January 2015 at 20:48, Lars notifications@github.com wrote:
|
~30% off the running time for a 3.7e5 set of 3-d points.
63f9e1d
to
43d552e
Compare
@jnothman I just pushed a new version with your workaround for the jagged arrays and a more extensive comment to explain what algorithm we're using. (It's no longer as "close to the Wikipedia" as my original was; that was before I realized DBSCAN is a connected components variant.) |
This is currently the fastest of the implementations, right? Any downsides? Otherwise I'd say rebase on master and I'll review? |
Downside is the fact it's in Cython/C++, and it needs a workaround (see comment) to get the exception handling right. |
I feel fine with it being Cython. seems reasonable if it significantly faster and it is not that much code. |
I feel fine with it being Cython. seems reasonable if it significantly faster
and it is not that much code.
It's also a method that is quite central for many people. Maybe we can
even release the GIL?
|
We could, but I'd rather have the k-NN computations run in parallel. Those are the really expensive part. |
I think we are ready to merge parallelism for k-NN computations (#4009), but appear to not gain as much as hoped from GIL release in radius-neighbors. I think DBSCAN should accept a precomputed output from |
I made a benchmark script. While this is significantly faster than master, it also yields a different clustering: the number of clusters is quite different (although the quality of the clustering is comparable based on ARI and NMI): on this branch:
on master:
Here is the script I used: import numpy as np
from time import time
from sklearn.datasets import make_blobs
from sklearn.utils import shuffle
from sklearn.cluster import DBSCAN
from sklearn.metrics.cluster import normalized_mutual_info_score
from sklearn.metrics.cluster import adjusted_rand_score
# Make some noise (and signal)
rng = np.random.RandomState(42)
n_samples = int(5e5)
n_signal_samples = int(0.9 * n_samples)
n_noise_samples = n_samples - n_signal_samples
signal, labels = make_blobs(n_samples=n_signal_samples, n_features=2,
cluster_std=.1, centers=500,
random_state=rng)
noise = rng.uniform(low=-15, high=15, size=(n_noise_samples, 2))
data = np.vstack([signal, noise])
labels = np.concatenate([labels, [-1] * noise.shape[0]])
data, labels = shuffle(data, labels, random_state=rng)
# Fit the model
dbscan = DBSCAN(eps=0.05, algorithm='kd_tree')
print("Fitting dbscan on %r-shaped blob data (with uniform noise)"
% (data.shape,))
t0 = time()
dbscan_labels = dbscan.fit_predict(data)
print("Duration: %0.3fs" % (time() - t0))
print("Found %d clusters" % len(np.unique(dbscan_labels)))
print("NMI: %0.4f" % normalized_mutual_info_score(labels, dbscan_labels))
print("ARI: %0.4f" % adjusted_rand_score(labels, dbscan_labels)) |
Actually I just realized that this PR still has the random permutation of the samples that has been removed from master. Let me try to rebase this branch. |
I have pushed the result of the rebase (to remove the random permutation) in ogrisel/rebased-pr-4157 but I still get the same number of clusters as previously on this branch:
@larsmans please feel free to force push the result of my rebase into your PR branch to get travis to run on it. |
We changed the minPts parameter to be in line with literature in #4066. The number of core points and clusters will have an effect on both implementations. |
Master has min_samples=5 by default. I rebased this PR on top of master so the parameters match. Just to be sure I changed my script to always set
So there is a slight discrepancy, maybe in the ordering of the expansions. No big deal though. I will try to give it a deeper look. |
The permutation should only affect border points, and thus never the number of clusters. Only single points can switch from one cluster to another (never to noise either). |
I found and fixed the bug (+ added a test) in my ogrisel/rebased-pr-4157 branch: 0edbff8 The problem was a strict inequality test on the size of the neighborhood to be considered a core sample. On my branch I now have the expected number of clusters (matching master):
|
+1 for merging the rebased + fixed version of this branch. On large 2D data the speed-up is impressive (3x vs master that is already much faster than the version in 0.15.2). If people agree I would also like to backport that to the 0.16.X branch for inclusion in the 0.16 release if the merge to master happens on time. |
@@ -111,7 +114,14 @@ def dbscan(X, eps=0.5, min_samples=5, metric='minkowski', | |||
neighbors_model.fit(X) | |||
neighborhoods = neighbors_model.radius_neighbors(X, eps, | |||
return_distance=False) | |||
neighborhoods = np.array(neighborhoods) | |||
|
|||
# Hack to make dtype=object, ndim=1 array when all neighborhoods |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not required on master any more, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. All neighbors models are now consistently return dtype=object
arrays for radius queries in master. I removed this hack in my branch and improved the test I added to check for algorithm
in ['brute', 'kd_tree', 'ball_tree']
.
Looks good to me apart from my comments. |
I opened #4401 to get travis to run the tests for all versions of Python / numpy / scipy. |
I think that this PR should be closed, now that #4401 is merged. |
Closing this. I forgot to try and release the GIL. But this can be tackled in another PR. |
Closing this. I forgot to try and release the GIL. But this can be tackled in
another PR.
Agreed. This is already a big win.
|
Actually, not closing. We need a what's new entry and a back port to the 0.16.X. Will probably do it by tomorrow if nobody does it first. |
I updated the what's new entry and did the backport to 0.16.X. |
Thanks everyone involved 🍻! |
@larsmans SciPy 0.16 will have parallel k-NN (Python threads and GIL release), but only for Euclidian and Minkowski distances. It is also optimized in a number of other ways. On my laptop with AMD Phenom 2 it runs about 20x faster, on my laptop with i7 it runs about 10x faster. Just pass |
Thanks everyone involved 🍻!
Indeed. Hurray!
|
Follow-up to #4151: Cython/C++ version of DBSCAN. ~30% off the running time for a 3.7e5 set of 3-d points. ball tree radius queries are now the bottleneck at >72% of the time. The inner loop/DFS algorithm takes <12% of the time.
I have some ideas about optimizing the radius queries, but I'll leave them out of this PR.