[MRG+1] optimize DBSCAN by rewriting in Cython #4157

larsmans · 2015-01-25T00:10:20Z

Follow-up to #4151: Cython/C++ version of DBSCAN. ~30% off the running time for a 3.7e5 set of 3-d points. ball tree radius queries are now the bottleneck at >72% of the time. The inner loop/DFS algorithm takes <12% of the time.

I have some ideas about optimizing the radius queries, but I'll leave them out of this PR.

larsmans · 2015-01-25T00:11:41Z

sklearn/cluster/_dbscan_inner.pyx

+
+cimport cython
+from libcpp.vector cimport vector
+cimport numpy as np


Annoyingly, cimport numpy increases the source file size by thousands of lines. Using memoryviews is even worse.

coveralls · 2015-01-25T12:05:47Z

Coverage decreased (-0.0%) to 94.78% when pulling d4da410 on larsmans:dbscan-faster into de92ada on scikit-learn:master.

jnothman · 2015-01-26T11:37:13Z

sklearn/cluster/dbscan_.py

@@ -111,7 +114,14 @@ def dbscan(X, eps=0.5, min_samples=5, metric='minkowski',
        neighbors_model.fit(X)
        neighborhoods = neighbors_model.radius_neighbors(X, eps,
                                                         return_distance=False)
-        neighborhoods = np.array(neighborhoods)
+
+    # Ugly hack to make dtype=object, ndim=1 array when all neighborhoods


Hopefully this will be fixed in neighbors module soon. But what's wrong with

tmp = np.empty(neighborhoods.shape[0], dtype='O') tmp[:] = neighborhoods.tolist() neighborhoods = tmp

it's a little bit clearer, perhaps?

jnothman · 2015-01-26T11:47:46Z

This looks functionally correct; I think readability could be improved for users that are not especially familiar with dbscan. The basic routine is to find connected components in core points, while labelling non-core points.

On the topic of which, how does this compare to just using scipy.csgraph.connected_components over radius_neighbors_graph output with core samples selected? Its disadvantage is that it's less flexible to un-batched computation of neighborhoods, and that it looks less like Wikipedia, but it might produce similar speed gains without those thousands of lines.

larsmans · 2015-01-26T22:23:22Z

Interesting idea, but I don't yet see how to add the border points to the components. After puzzling for some time, I decided to profile without handling the border points:

def dbscan(X, eps=0.5, min_samples=5, metric='minkowski',
           algorithm='auto', leaf_size=30, p=2, sample_weight=None,
           random_state=None):
    if not eps > 0.0:
        raise ValueError("eps must be positive.")

    X = check_array(X, accept_sparse='csr')
    if sample_weight is not None:
        sample_weight = np.asarray(sample_weight)
        check_consistent_length(X, sample_weight)

    neighbors_model = NearestNeighbors(radius=eps, algorithm=algorithm,
                                       leaf_size=leaf_size,
                                       metric=metric, p=p)
    neighbors_model.fit(X)
    G = neighbors_model.radius_neighbors_graph(X, eps, mode='connectivity')

    is_core = np.asarray(G.sum(axis=0)).ravel() > min_samples
    core_samples = np.where(is_core)[0]

    core_labels = connected_components(G[core_samples][:, core_samples],
                                       directed=False)[1]

    labels = -np.ones(X.shape[0], dtype=np.intp)
    labels[is_core] = core_labels

    # Add the border points: assign non-core points to some core point
    # within an eps radius, if any.
    #noise_to_core = G[np.where(~is_core)][:, core_samples]

    return core_samples, labels

Alas, this is >30% slower than the C++ version (which may still benefit from the graph representation). I also don't see how to keep the randomization in this version.

amueller · 2015-01-26T22:58:21Z

thanks for trying :)

jnothman · 2015-01-26T23:56:09Z

There has been a suggestion to get rid of the randomization, on the basis
that the algorithm is mostly deterministic: areas of density below the core
sample threshold, but within reach of more than one core sample of
different clusters should be few.

Regarding the non-core points, the equivalent of the original would be
something like:

Gprime = G[:, core_ordered_by_component]
non_core_labels = core_labels[Gprime.indices[Gprime.indptr[non_core]]]

Obviously there is some cost in that too...

When you say >30% slower, does that mean roughly same as master? I wonder
what the slow part is (and note you can avoid the sum with
np.diff(G.indptr)). Does that relative time exclude the cost of
radius_neighbors?

On 27 January 2015 at 09:23, Lars notifications@github.com wrote:

Interesting idea, but I don't yet see how to add the border points to
the components. After puzzling for some time, I decided to profile without
handling the border points:

def dbscan(X, eps=0.5, min_samples=5, metric='minkowski',
algorithm='auto', leaf_size=30, p=2, sample_weight=None,
random_state=None):
if not eps > 0.0:
raise ValueError("eps must be positive.")
X = check_array(X, accept_sparse='csr')
if sample_weight is not None:
    sample_weight = np.asarray(sample_weight)
    check_consistent_length(X, sample_weight)

neighbors_model = NearestNeighbors(radius=eps, algorithm=algorithm,
                                   leaf_size=leaf_size,
                                   metric=metric, p=p)
neighbors_model.fit(X)
G = neighbors_model.radius_neighbors_graph(X, eps, mode='connectivity')

is_core = np.asarray(G.sum(axis=0)).ravel() > min_samples
core_samples = np.where(is_core)[0]

core_labels = connected_components(G[core_samples][:, core_samples],
                                   directed=False)[1]

labels = -np.ones(X.shape[0], dtype=np.intp)
labels[is_core] = core_labels

# Add the border points: assign non-core points to some core point
# within an eps radius, if any.
#noise_to_core = G[np.where(~is_core)][:, core_samples]

return core_samples, labels
Alas, this is >30% slower than the C++ version (which may still benefit
from the graph representation). I also don't see how to keep the
randomization in this version.

—
Reply to this email directly or view it on GitHub
#4157 (comment)
.

larsmans · 2015-01-27T09:48:57Z

As fast as master, indeed, and the relative time includes radius_neighbors so the C++ loop wins big time if we optimize the ball trees. After separating the CSR matrix slicing and the actual CC algorithm, here's the line-by-line profile:

   109         1       268484  268484.0      3.9      neighbors_model.fit(X)
   110         1      3626855 3626855.0     52.9      G = neighbors_model.radius_neighbors_graph(X, eps, mode='connectivity')
   111                                           
   112         1       125303  125303.0      1.8      is_core = np.asarray(G.sum(axis=0)).ravel() > min_samples
   113         1         1212    1212.0      0.0      core_samples = np.where(is_core)[0]
   114                                           
   115         1      1932676 1932676.0     28.2      Gslice = G[core_samples][:, core_samples]
   116         1       901430  901430.0     13.1      core_labels = connected_components(Gslice, directed=False)[1]

Putting in a .tocsc() makes things worse.

jnothman · 2015-01-27T10:27:49Z

Yes, I suspected the slicing was the slow part. Thanks for trying!

On 27 January 2015 at 20:48, Lars notifications@github.com wrote:

As fast as master, indeed, and the relative time includes
radius_neighbors so the C++ loop wins big time if we optimize the ball
trees. After separating the CSR matrix slicing and the actual CC algorithm,
here's the line-by-line profile:

109 1 268484 268484.0 3.9 neighbors_model.fit(X)
110 1 3626855 3626855.0 52.9 G = neighbors_model.radius_neighbors_graph(X, eps, mode='connectivity')
111
112 1 125303 125303.0 1.8 is_core = np.asarray(G.sum(axis=0)).ravel() > min_samples
113 1 1212 1212.0 0.0 core_samples = np.where(is_core)[0]
114
115 1 1932676 1932676.0 28.2 Gslice = G[core_samples][:, core_samples]
116 1 901430 901430.0 13.1 core_labels = connected_components(Gslice, directed=False)[1]

Putting in a .tocsc() makes things worse.

—
Reply to this email directly or view it on GitHub
#4157 (comment)
.

~30% off the running time for a 3.7e5 set of 3-d points.

larsmans · 2015-02-05T21:16:02Z

@jnothman I just pushed a new version with your workaround for the jagged arrays and a more extensive comment to explain what algorithm we're using. (It's no longer as "close to the Wikipedia" as my original was; that was before I realized DBSCAN is a connected components variant.)

coveralls · 2015-02-05T21:29:31Z

Coverage decreased (-0.01%) to 94.99% when pulling 43d552e on larsmans:dbscan-faster into 1d08f08 on scikit-learn:master.

amueller · 2015-03-09T16:30:45Z

This is currently the fastest of the implementations, right? Any downsides? Otherwise I'd say rebase on master and I'll review?

larsmans · 2015-03-09T16:34:42Z

Downside is the fact it's in Cython/C++, and it needs a workaround (see comment) to get the exception handling right.

amueller · 2015-03-09T16:38:47Z

I feel fine with it being Cython. seems reasonable if it significantly faster and it is not that much code.

GaelVaroquaux · 2015-03-09T18:03:22Z

I feel fine with it being Cython. seems reasonable if it significantly faster and it is not that much code.

It's also a method that is quite central for many people. Maybe we can even release the GIL?

larsmans · 2015-03-09T21:57:17Z

We could, but I'd rather have the k-NN computations run in parallel. Those are the really expensive part.

jnothman · 2015-03-10T00:49:16Z

We could, but I'd rather have the k-NN computations run in parallel. Those are the really expensive part.

I think we are ready to merge parallelism for k-NN computations (#4009), but appear to not gain as much as hoped from GIL release in radius-neighbors. I think DBSCAN should accept a precomputed output from radius_neighbors_graph so that the user can control the neighborhood calculation without the memory blowout of a dense precomputed matrix (although in such a case eps is ignored). Otherwise, we should accept an n_jobs parameter to DBSCAN to pass to the local radius_neighbors calculation when this parameter is shortly available.

ogrisel · 2015-03-13T11:16:00Z

I made a benchmark script. While this is significantly faster than master, it also yields a different clustering: the number of clusters is quite different (although the quality of the clustering is comparable based on ARI and NMI):

on this branch:

Fitting dbscan on (500000, 2)-shaped blob data (with uniform noise)
Duration: 12.559s
Found 287 clusters
NMI: 0.8924
ARI: 0.6817

on master:

Fitting dbscan on (500000, 2)-shaped blob data (with uniform noise)
Duration: 46.335s
Found 347 clusters
NMI: 0.8920
ARI: 0.6815

Here is the script I used:

import numpy as np
from time import time
from sklearn.datasets import make_blobs
from sklearn.utils import shuffle
from sklearn.cluster import DBSCAN
from sklearn.metrics.cluster import normalized_mutual_info_score
from sklearn.metrics.cluster import adjusted_rand_score


# Make some noise (and signal)
rng = np.random.RandomState(42)
n_samples = int(5e5)
n_signal_samples = int(0.9 * n_samples)
n_noise_samples = n_samples - n_signal_samples

signal, labels = make_blobs(n_samples=n_signal_samples, n_features=2,
                            cluster_std=.1, centers=500,
                            random_state=rng)
noise = rng.uniform(low=-15, high=15, size=(n_noise_samples, 2))
data = np.vstack([signal, noise])
labels = np.concatenate([labels, [-1] * noise.shape[0]])
data, labels = shuffle(data, labels, random_state=rng)

# Fit the model
dbscan = DBSCAN(eps=0.05, algorithm='kd_tree')
print("Fitting dbscan on %r-shaped blob data (with uniform noise)"
      % (data.shape,))
t0 = time()
dbscan_labels = dbscan.fit_predict(data)
print("Duration: %0.3fs" % (time() - t0))
print("Found %d clusters" % len(np.unique(dbscan_labels)))
print("NMI: %0.4f" % normalized_mutual_info_score(labels, dbscan_labels))
print("ARI: %0.4f" % adjusted_rand_score(labels, dbscan_labels))

ogrisel · 2015-03-13T11:19:43Z

Actually I just realized that this PR still has the random permutation of the samples that has been removed from master. Let me try to rebase this branch.

ogrisel · 2015-03-13T11:43:21Z

I have pushed the result of the rebase (to remove the random permutation) in ogrisel/rebased-pr-4157 but I still get the same number of clusters as previously on this branch:

Fitting dbscan on (500000, 2)-shaped blob data (with uniform noise)
Duration: 12.315s
Found 287 clusters
NMI: 0.8924
ARI: 0.6817

@larsmans please feel free to force push the result of my rebase into your PR branch to get travis to run on it.

kno10 · 2015-03-13T12:38:20Z

We changed the minPts parameter to be in line with literature in #4066.
Can you run the experiment with min_samples set to 6?

The number of core points and clusters will have an effect on both implementations.

ogrisel · 2015-03-13T12:55:53Z

Master has min_samples=5 by default. I rebased this PR on top of master so the parameters match.

Just to be sure I changed my script to always set min_samples=6 and print the parameter values:

master:

Fitting dbscan on (500000, 2)-shaped blob data (with uniform noise)
DBSCAN(algorithm='kd_tree', eps=0.05, leaf_size=30, metric='euclidean',
    min_samples=6, p=None, random_state=None)
Duration: 48.528s
Found 287 clusters
NMI: 0.8924
ARI: 0.6817

rebased-pr-4157

Fitting dbscan on (500000, 2)-shaped blob data (with uniform noise)
DBSCAN(algorithm='kd_tree', eps=0.05, leaf_size=30, metric='euclidean',
    min_samples=6, p=None, random_state=None)
Duration: 12.868s
Found 281 clusters
NMI: 0.8966
ARI: 0.6869

So there is a slight discrepancy, maybe in the ordering of the expansions. No big deal though. I will try to give it a deeper look.

kno10 · 2015-03-13T13:14:34Z

The permutation should only affect border points, and thus never the number of clusters. Only single points can switch from one cluster to another (never to noise either).

ogrisel · 2015-03-13T14:29:12Z

I found and fixed the bug (+ added a test) in my ogrisel/rebased-pr-4157 branch: 0edbff8

The problem was a strict inequality test on the size of the neighborhood to be considered a core sample. On my branch I now have the expected number of clusters (matching master):

Fitting dbscan on (500000, 2)-shaped blob data (with uniform noise)
DBSCAN(algorithm='kd_tree', eps=0.05, leaf_size=30, metric='euclidean',
    min_samples=5, p=None, random_state=None)
Duration: 13.364s
Found 347 clusters
NMI: 0.8920
ARI: 0.6815

ogrisel · 2015-03-13T14:37:39Z

+1 for merging the rebased + fixed version of this branch. On large 2D data the speed-up is impressive (3x vs master that is already much faster than the version in 0.15.2).

If people agree I would also like to backport that to the 0.16.X branch for inclusion in the 0.16 release if the merge to master happens on time.

amueller · 2015-03-16T20:46:07Z

sklearn/cluster/dbscan_.py

@@ -111,7 +114,14 @@ def dbscan(X, eps=0.5, min_samples=5, metric='minkowski',
        neighbors_model.fit(X)
        neighborhoods = neighbors_model.radius_neighbors(X, eps,
                                                         return_distance=False)
-        neighborhoods = np.array(neighborhoods)
+
+    # Hack to make dtype=object, ndim=1 array when all neighborhoods


This is not required on master any more, right?

Good catch. All neighbors models are now consistently return dtype=object arrays for radius queries in master. I removed this hack in my branch and improved the test I added to check for algorithm in ['brute', 'kd_tree', 'ball_tree'].

amueller · 2015-03-16T22:31:17Z

Looks good to me apart from my comments.

ogrisel · 2015-03-17T15:27:05Z

I opened #4401 to get travis to run the tests for all versions of Python / numpy / scipy.

GaelVaroquaux · 2015-03-17T15:52:57Z

I think that this PR should be closed, now that #4401 is merged.

ogrisel · 2015-03-17T16:09:37Z

Closing this. I forgot to try and release the GIL. But this can be tackled in another PR.

GaelVaroquaux · 2015-03-17T16:10:52Z

Closing this. I forgot to try and release the GIL. But this can be tackled in another PR.

Agreed. This is already a big win.

ogrisel · 2015-03-17T16:17:57Z

Actually, not closing. We need a what's new entry and a back port to the 0.16.X. Will probably do it by tomorrow if nobody does it first.

ogrisel · 2015-03-17T17:48:05Z

I updated the what's new entry and did the backport to 0.16.X.

ogrisel · 2015-03-17T17:48:26Z

Thanks everyone involved 🍻!

sturlamolden · 2015-05-15T15:07:11Z

@larsmans SciPy 0.16 will have parallel k-NN (Python threads and GIL release), but only for Euclidian and Minkowski distances. It is also optimized in a number of other ways. On my laptop with AMD Phenom 2 it runs about 20x faster, on my laptop with i7 it runs about 10x faster. Just pass njobs=-1to scipy.spatial.cKDTree.query.

GaelVaroquaux · 2015-05-15T16:01:30Z

Thanks everyone involved 🍻!

Indeed. Hurray!

larsmans reviewed Jan 25, 2015
View reviewed changes

larsmans force-pushed the dbscan-faster branch 2 times, most recently from 0e7d52e to d4da410 Compare January 25, 2015 11:53

larsmans force-pushed the dbscan-faster branch from d4da410 to 63f9e1d Compare January 25, 2015 16:07

larsmans mentioned this pull request Jan 25, 2015

[MRG] optimize DBSCAN (~10% faster on my data) #4151

Merged

jnothman reviewed Jan 26, 2015
View reviewed changes

ENH optimize DBSCAN by rewriting in Cython

43d552e

~30% off the running time for a 3.7e5 set of 3-d points.

larsmans force-pushed the dbscan-faster branch from 63f9e1d to 43d552e Compare February 5, 2015 21:14

larsmans mentioned this pull request Feb 7, 2015

[MRG] Neighbors refactor #4217

Closed

jnothman mentioned this pull request Mar 3, 2015

[MRG + 1] Do not shuffle by default for DBSCAN. #4066

Closed

kno10 mentioned this pull request Mar 4, 2015

[MRG] Faster vectorization of DBSCAN (plain python) #4334

Closed

ogrisel changed the title ~~[MRG] optimize DBSCAN by rewriting in Cython~~ [MRG+1] optimize DBSCAN by rewriting in Cython Mar 13, 2015

amueller reviewed Mar 16, 2015
View reviewed changes

ogrisel mentioned this pull request Mar 17, 2015

[MRG+2] Rebased version of the Cythonized DBSCAN main loop #4401

Merged

ogrisel closed this Mar 17, 2015

larsmans deleted the dbscan-faster branch March 18, 2015 10:25

Uh oh!

[MRG+1] optimize DBSCAN by rewriting in Cython #4157

[MRG+1] optimize DBSCAN by rewriting in Cython #4157

Uh oh!

Conversation

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!