8000 [MRG+1] optimize DBSCAN by rewriting in Cython by larsmans · Pull Request #4157 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG+1] optimize DBSCAN by rewriting in Cython #4157

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

larsmans
Copy link
Member

Follow-up to #4151: Cython/C++ version of DBSCAN. ~30% off the running time for a 3.7e5 set of 3-d points. ball tree radius queries are now the bottleneck at >72% of the time. The inner loop/DFS algorithm takes <12% of the time.

I have some ideas about optimizing the radius queries, but I'll leave them out of this PR.


cimport cython
from libcpp.vector cimport vector
cimport numpy as np
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Annoyingly, cimport numpy increases the source file size by thousands of lines. Using memoryviews is even worse.

@larsmans larsmans force-pushed the dbscan-faster branch 2 times, most recently from 0e7d52e to d4da410 Compare January 25, 2015 11:53
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.0%) to 94.78% when pulling d4da410 on larsmans:dbscan-faster into de92ada on scikit-learn:master.

@@ -111,7 +114,14 @@ def dbscan(X, eps=0.5, min_samples=5, metric='minkowski',
neighbors_model.fit(X)
neighborhoods = neighbors_model.radius_neighbors(X, eps,
return_distance=False)
neighborhoods = np.array(neighborhoods)

# Ugly hack to make dtype=object, ndim=1 array when all neighborhoods
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully this will be fixed in neighbors module soon. But what's wrong with

tmp = np.empty(neighborhoods.shape[0], dtype='O')
tmp[:] = neighborhoods.tolist()
neighborhoods = tmp

it's a little bit clearer, perhaps?

@jnothman
Copy link
Member

This looks functionally correct; I think readability could be improved for users that are not especially familiar with dbscan. The basic routine is to find connected components in core points, while labelling non-core points.

On the topic of which, how does this compare to just using scipy.csgraph.connected_components over radius_neighbors_graph output with core samples selected? Its disadvantage is that it's less flexible to un-batched computation of neighborhoods, and that it looks less like Wikipedia, but it might produce similar speed gains without those thousands of lines.

@larsmans
Copy link
Member Author

Interesting idea, but I don't yet see how to add the border points to the components. After puzzling for some time, I decided to profile without handling the border points:

def dbscan(X, eps=0.5, min_samples=5, metric='minkowski',
           algorithm='auto', leaf_size=30, p=2, sample_weight=None,
           random_state=None):
    if not eps > 0.0:
        raise ValueError("eps must be positive.")

    X = check_array(X, accept_sparse='csr')
    if sample_weight is not None:
        sample_weight = np.asarray(sample_weight)
        check_consistent_length(X, sample_weight)

    neighbors_model = NearestNeighbors(radius=eps, algorithm=algorithm,
                                       leaf_size=leaf_size,
                                       metric=metric, p=p)
    neighbors_model.fit(X)
    G = neighbors_model.radius_neighbors_graph(X, eps, mode='connectivity')

    is_core = np.asarray(G.sum(axis=0)).ravel() > min_samples
    core_samples = np.where(is_core)[0]

    core_labels = connected_components(G[core_samples][:, core_samples],
                                       directed=False)[1]

    labels = -np.ones(X.shape[0], dtype=np.intp)
    labels[is_core] = core_labels

    # Add the border points: assign non-core points to some core point
    # within an eps radius, if any.
    #noise_to_core = G[np.where(~is_core)][:, core_samples]

    return core_samples, labels

Alas, this is >30% slower than the C++ version (which may still benefit from the graph representation). I also don't see how to keep the randomization in this version.

@amueller
Copy link
Member

thanks for trying :)

@jnothman
Copy link
Member

There has been a suggestion to get rid of the randomization, on the basis
that the algorithm is mostly deterministic: areas of density below the core
sample threshold, but within reach of more than one core sample of
different clusters should be few.

Regarding the non-core points, the equivalent of the original would be
something like:

Gprime = G[:, core_ordered_by_component]
non_core_labels = core_labels[Gprime.indices[Gprime.indptr[non_core]]]

Obviously there is some cost in that too...

When you say >30% slower, does that mean roughly same as master? I wonder
what the slow part is (and note you can avoid the sum with
np.diff(G.indptr)). Does that relative time exclude the cost of
radius_neighbors?

On 27 January 2015 at 09:23, Lars notifications@github.com wrote:

Interesting idea, but I don't yet see how to add the border points to
the components. After puzzling for some time, I decided to profile without
handling the border points:

def dbscan(X, eps=0.5, min_samples=5, metric='minkowski',
algorithm='auto', leaf_size=30, p=2, sample_weight=None,
random_state=None):
if not eps > 0.0:
raise ValueError("eps must be positive.")

X = check_array(X, accept_sparse='csr')
if sample_weight is not None:
    sample_weight = np.asarray(sample_weight)
    check_consistent_length(X, sample_weight)

neighbors_model = NearestNeighbors(radius=eps, algorithm=algorithm,
                                   leaf_size=leaf_size,
                                   metric=metric, p=p)
neighbors_model.fit(X)
G = neighbors_model.radius_neighbors_graph(X, eps, mode='connectivity')

is_core = np.asarray(G.sum(axis=0)).ravel() > min_samples
core_samples = np.where(is_core)[0]

core_labels = connected_components(G[core_samples][:, core_samples],
                                   directed=False)[1]

labels = -np.ones(X.shape[0], dtype=np.intp)
labels[is_core] = core_labels

# Add the border points: assign non-core points to some core point
# within an eps radius, if any.
#noise_to_core = G[np.where(~is_core)][:, core_samples]

return core_samples, labels

Alas, this is >30% slower than the C++ version (which may still benefit
from the graph representation). I also don't see how to keep the
randomization in this version.


Reply to this email directly or view it on GitHub
#4157 (comment)
.

@larsmans
Copy link
Member Author

As fast as master, indeed, and the relative time includes radius_neighbors so the C++ loop wins big time if we optimize the ball trees. After separating the CSR matrix slicing and the actual CC algorithm, here's the line-by-line profile:

   109         1       268484  268484.0      3.9      neighbors_model.fit(X)
   110         1      3626855 3626855.0     52.9      G = neighbors_model.radius_neighbors_graph(X, eps, mode='connectivity')
   111                                           
   112         1       125303  125303.0      1.8      is_core = np.asarray(G.sum(axis=0)).ravel() > min_samples
   113         1         1212    1212.0      0.0      core_samples = np.where(is_core)[0]
   114                                           
   115         1      1932676 1932676.0     28.2      Gslice = G[core_samples][:, core_samples]
   116         1       901430  901430.0     13.1      core_labels = connected_components(Gslice, directed=False)[1]

Putting in a .tocsc() makes things worse.

@jnothman
Copy link
Member

Yes, I suspected the slicing was the slow part. Thanks for trying!

On 27 January 2015 at 20:48, Lars notifications@github.com wrote:

As fast as master, indeed, and the relative time includes
radius_neighbors so the C++ loop wins big time if we optimize the ball
trees. After separating the CSR matrix slicing and the actual CC algorithm,
here's the line-by-line profile:

109 1 268484 268484.0 3.9 neighbors_model.fit(X)
110 1 3626855 3626855.0 52.9 G = neighbors_model.radius_neighbors_graph(X, eps, mode='connectivity')
111
112 1 125303 125303.0 1.8 is_core = np.asarray(G.sum(axis=0)).ravel() > min_samples
113 1 1212 1212.0 0.0 core_samples = np.where(is_core)[0]
114
115 1 1932676 1932676.0 28.2 Gslice = G[core_samples][:, core_samples]
116 1 901430 901430.0 13.1 core_labels = connected_components(Gslice, directed=False)[1]

Putting in a .tocsc() makes things worse.


Reply to this email directly or view it on GitHub
#4157 (comment)
.

~30% off the running time for a 3.7e5 set of 3-d points.
@larsmans
Copy link
Member Author
larsmans commented Feb 5, 2015

@jnothman I just pushed a new version with your workaround for the jagged arrays and a more extensive comment to explain what algorithm we're using. (It's no longer as "close to the Wikipedia" as my original was; that was before I realized DBSCAN is a connected components variant.)

@coveralls
Copy link
8000

Coverage Status

Coverage decreased (-0.01%) to 94.99% when pulling 43d552e on larsmans:dbscan-faster into 1d08f08 on scikit-learn:master.

@amueller
Copy link
Member
amueller commented Mar 9, 2015

This is currently the fastest of the implementations, right? Any downsides? Otherwise I'd say rebase on master and I'll review?

@larsmans
Copy link
Member Author
larsmans commented Mar 9, 2015

Downside is the fact it's in Cython/C++, and it needs a workaround (see comment) to get the exception handling right.

@amueller
Copy link
Member
amueller commented Mar 9, 2015

I feel fine with it being Cython. seems reasonable if it significantly faster and it is not that much code.

@GaelVaroquaux
Copy link
Member
GaelVaroquaux commented Mar 9, 2015 via email

@larsmans
Copy link
Member Author
larsmans commented Mar 9, 2015

We could, but I'd rather have the k-NN computations run in parallel. Those are the really expensive part.

@jnothman
Copy link
Member

We could, but I'd rather have the k-NN computations run in parallel. Those are the really expensive part.

I think we are ready to merge parallelism for k-NN computations (#4009), but appear to not gain as much as hoped from GIL release in radius-neighbors. I think DBSCAN should accept a precomputed output from radius_neighbors_graph so that the user can control the neighborhood calculation without the memory blowout of a dense precomputed matrix (although in such a case eps is ignored). Otherwise, we should accept an n_jobs parameter to DBSCAN to pass to the local radius_neighbors calculation when this parameter is shortly available.

@ogrisel
Copy link
Member
ogrisel commented Mar 13, 2015

I made a benchmark script. While this is significantly faster than master, it also yields a different clustering: the number of clusters is quite different (although the quality of the clustering is comparable based on ARI and NMI):

on this branch:

Fitting dbscan on (500000, 2)-shaped blob data (with uniform noise)
Duration: 12.559s
Found 287 clusters
NMI: 0.8924
ARI: 0.6817

on master:

Fitting dbscan on (500000, 2)-shaped blob data (with uniform noise)
Duration: 46.335s
Found 347 clusters
NMI: 0.8920
ARI: 0.6815

Here is the script I used:

import numpy as np
from time import time
from sklearn.datasets import make_blobs
from sklearn.utils import shuffle
from sklearn.cluster import DBSCAN
from sklearn.metrics.cluster import normalized_mutual_info_score
from sklearn.metrics.cluster import adjusted_rand_score


# Make some noise (and signal)
rng = np.random.RandomState(42)
n_samples = int(5e5)
n_signal_samples = int(0.9 * n_samples)
n_noise_samples = n_samples - n_signal_samples

signal, labels = make_blobs(n_samples=n_signal_samples, n_features=2,
                            cluster_std=.1, centers=500,
                            random_state=rng)
noise = rng.uniform(low=-15, high=15, size=(n_noise_samples, 2))
data = np.vstack([signal, noise])
labels = np.concatenate([labels, [-1] * noise.shape[0]])
data, labels = shuffle(data, labels, random_state=rng)

# Fit the model
dbscan = DBSCAN(eps=0.05, algorithm='kd_tree')
print("Fitting dbscan on %r-shaped blob data (with uniform noise)"
      % (data.shape,))
t0 = time()
dbscan_labels = dbscan.fit_predict(data)
print("Duration: %0.3fs" % (time() - t0))
print("Found %d clusters" % len(np.unique(dbscan_labels)))
print("NMI: %0.4f" % normalized_mutual_info_score(labels, dbscan_labels))
print("ARI: %0.4f" % adjusted_rand_score(labels, dbscan_labels))

@ogrisel
Copy link
Member
ogrisel commented Mar 13, 2015

Actually I just realized that this PR still has the random permutation of the samples that has been removed from master. Let me try to rebase this branch.

@ogrisel
Copy link
Member
ogrisel commented Mar 13, 2015

I have pushed the result of the rebase (to remove the random permutation) in ogrisel/rebased-pr-4157 but I still get the same number of clusters as previously on this branch:

Fitting dbscan on (500000, 2)-shaped blob data (with uniform noise)
Duration: 12.315s
Found 287 clusters
NMI: 0.8924
ARI: 0.6817

@larsmans please feel free to force push the result of my rebase into your PR branch to get travis to run on it.

@kno10
Copy link
Contributor
kno10 commented Mar 13, 2015

We changed the minPts parameter to be in line with literature in #4066.
Can you run the experiment with min_samples set to 6?

The number of core points and clusters will have an effect on both implementations.

@ogrisel
Copy link
Member
ogrisel commented Mar 13, 2015

Master has min_samples=5 by default. I rebased this PR on top of master so the parameters match.

Just to be sure I changed my script to always set min_samples=6 and print the parameter values:

  • master:
Fitting dbscan on (500000, 2)-shaped blob data (with uniform noise)
DBSCAN(algorithm='kd_tree', eps=0.05, leaf_size=30, metric='euclidean',
    min_samples=6, p=None, random_state=None)
Duration: 48.528s
Found 287 clusters
NMI: 0.8924
ARI: 0.6817
  • rebased-pr-4157
Fitting dbscan on (500000, 2)-shaped blob data (with uniform noise)
DBSCAN(algorithm='kd_tree', eps=0.05, leaf_size=30, metric='euclidean',
    min_samples=6, p=None, random_state=None)
Duration: 12.868s
Found 281 clusters
NMI: 0.8966
ARI: 0.6869

So there is a slight discrepancy, maybe in the ordering of the expansions. No big deal though. I will try to give it a deeper look.

@kno10
Copy link
Contributor
kno10 commented Mar 13, 2015

The permutation should only affect border points, and thus never the number of clusters. Only single points can switch from one cluster to another (never to noise either).

@ogrisel
Copy link
Member
ogrisel commented Mar 13, 2015

I found and fixed the bug (+ added a test) in my ogrisel/rebased-pr-4157 branch: 0edbff8

The problem was a strict inequality test on the size of the neighborhood to be considered a core sample. On my branch I now have the expected number of clusters (matching master):

Fitting dbscan on (500000, 2)-shaped blob data (with uniform noise)
DBSCAN(algorithm='kd_tree', eps=0.05, leaf_size=30, metric='euclidean',
    min_samples=5, p=None, random_state=None)
Duration: 13.364s
Found 347 clusters
NMI: 0.8920
ARI: 0.6815

@ogrisel
Copy link
Member
ogrisel commented Mar 13, 2015

+1 for merging the rebased + fixed version of this branch. On large 2D data the speed-up is impressive (3x vs master that is already much faster than the version in 0.15.2).

If people agree I would also like to backport that to the 0.16.X branch for inclusion in the 0.16 release if the merge to master happens on time.

@ogrisel ogrisel changed the title [MRG] optimize DBSCAN by rewriting in Cython [MRG+1] optimize DBSCAN by rewriting in Cython Mar 13, 2015
@@ -111,7 +114,14 @@ def dbscan(X, eps=0.5, min_samples=5, metric='minkowski',
neighbors_model.fit(X)
neighborhoods = neighbors_model.radius_neighbors(X, eps,
return_distance=False)
neighborhoods = np.array(neighborhoods)

# Hack to make dtype=object, ndim=1 array when all neighborhoods
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not required on master any more, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. All neighbors models are now consistently return dtype=object arrays for radius queries in master. I removed this hack in my branch and improved the test I added to check for algorithm in ['brute', 'kd_tree', 'ball_tree'].

@amueller
Copy link
Member

Looks good to me apart from my comments.

@ogrisel
Copy link
Member
ogrisel commented Mar 17, 2015

I opened #4401 to get travis to run the tests for all versions of Python / numpy / scipy.

@GaelVaroquaux
Copy link
Member

I think that this PR should be closed, now that #4401 is merged.

@ogrisel
Copy link
Member
ogrisel commented Mar 17, 2015

Closing this. I forgot to try and release the GIL. But this can be tackled in another PR.

@GaelVaroquaux
Copy link
Member
GaelVaroquaux commented Mar 17, 2015 via email

@ogrisel
Copy link
Member
ogrisel commented Mar 17, 2015

Actually, not closing. We need a what's new entry and a back port to the 0.16.X. Will probably do it by tomorrow if nobody does it first.

@ogrisel
Copy link
Member
ogrisel commented Mar 17, 2015

I updated the what's new entry and did the backport to 0.16.X.

@ogrisel ogrisel closed this Mar 17, 2015
@ogrisel
Copy link
Member
ogrisel commented Mar 17, 2015

Thanks everyone involved 🍻!

AEBA
@larsmans larsmans deleted the dbscan-faster branch March 18, 2015 10:25
@sturlamolden
Copy link
Contributor

@larsmans SciPy 0.16 will have parallel k-NN (Python threads and GIL release), but only for Euclidian and Minkowski distances. It is also optimized in a number of other ways. On my laptop with AMD Phenom 2 it runs about 20x faster, on my laptop with i7 it runs about 10x faster. Just pass njobs=-1to scipy.spatial.cKDTree.query.

@GaelVaroquaux
Copy link
Member
GaelVaroquaux commented May 15, 2015 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants
0