[MRG] optimize DBSCAN (~10% faster on my data) #4151

larsmans · 2015-01-23T09:55:15Z

I've found that, on a set of 380k samples (3 features; point clouds), radius_neighbors takes 33% of the time, np.unique 32% and intersect1d 15%. This speeds up intersect1d by a factor two and takes away some minor overhead.

Also fixes the path of an example script.

ENH optimize DBSCAN

GaelVaroquaux · 2015-01-23T11:52:02Z

Thanks!

larsmans · 2015-01-23T12:26:47Z

Still not fast enough, though. The 380k dataset is one of the smaller ones...

amueller · 2015-01-23T15:40:13Z

How about birch? Shouldn't that be faster?
Btw, most point cloud clustering search only in a spacial neighborhood ;)

larsmans · 2015-01-23T16:31:16Z

But that's what DBSCAN is supposed to do, isn't it? It does radius neighbor queries and expands the clusters from those. The radius queries are also one of the two bottlenecks, the other one being the sorting in np.unique. Getting rid of the latter would require a mergesorted function, which is not available in NumPy, but maybe not very hard to write in Cython.

I already found out that it's nearly impossible to get radius queries to run faster by conventional means like recursion unrolling.

I haven't tried Birch since I don't know how it will behave. What I'm trying to do is actually segmentation by identifying the k biggest clusters. DBSCAN on simple (x,y,z) vectors works like magic, much better than any of the similarly simple options in PCL.

amueller · 2015-01-23T19:24:15Z

Well if you have something like point clouds from a 2d depth sensor like the kinect, you already have the data in a grid-structure and don't need to build the tree.

GaelVaroquaux · 2015-01-23T19:32:38Z

Well if you have something like point clouds from a 2d depth sensor
like the kinect, you already have the data in a grid-structure and
don't need to build the tree.

You should be using our agglomerative clustering in these situations. It
is very fast when imposing a connectivity

amueller · 2015-01-23T19:35:26Z

Pretty sure it is not faster than the algorithms that actually encode the grid structure in the algorithm ;) (which are my contributions to scikit-image a while back)

jnothman · 2015-01-24T10:23:03Z

Thanks, @larsmans, I wasn't aware of that option.

Some of this new implementation may be reverted given the discussions in #4066 (ping @kno10), which seem to suggest that the lower memory costs of the version at 0.15 may be worthwhile, and more in keeping with the original algorithm, while otherwise we could suggest that users for whom memory isn't a problem precompute the pairwise distance matrix (or perhaps just a sparse radius neighbors matrix) for speed. Can you please benchmark this version with your dataset, vs 0.15 with metric='precomputed'?

larsmans · 2015-01-24T11:17:10Z

Precompute (3.8e5)^2 distances? Are you kidding?

jnothman · 2015-01-24T11:26:35Z

I didn't register the numbers, it seems. And I guess we'd need to write it differently again if we accepted a precomputed radius_neighbors_graph as input, but that could be an option. Okay, thanks.

kno10 · 2015-01-24T14:16:59Z

I don't know if you had a look at the changes I did in e48ade5
(in my patch-2 branch) which are a bit faster for me.
From my observations: if you can afford the memory, using a distance matrix is fastest.
On low-dimensional data with Euclidean distance, the kd-tree can work reasonably well. If you have enough memory to first compute the neighborhoods (as in HEAD) this can be quite a bit faster.
If you are low in memory and use a distance that won't work with the kd-tree, the old code may be faster...

larsmans · 2015-01-24T14:21:11Z

The old code is actually several orders of magnitude slower. Its only benefit is that it fits in memory. I tried cythonizing it, but that doesn't buy much. Maybe it can be improved with an LRU cache on the neighbors computations, but that will be tricky...

jnothman · 2015-01-24T20:41:48Z

The old code is actually several orders of magnitude slower

Including with precomputed neighbors, I presume?

Maybe it can be improved with an LRU cache on the neighbors computations,

I think it only visits each point once to calculate neighbors, so I'm not
sure what we're caching.

On 25 January 2015 at 01:21, Lars notifications@github.com wrote:

The old code is actually several orders of magnitude slower. Its only
benefit is that it fits in memory. I tried cythonizing it, but that doesn't
buy much. Maybe it can be improved with an LRU cache on the neighbors
computations, but that will be tricky...

—
Reply to this email directly or view it on GitHub
#4151 (comment)
.

kno10 · 2015-01-24T21:17:03Z

DBSCAN would usually compute each distance twice; there is some potential in caching, but not that much. (You also only need to cache whether the distance was less than epsilon. But managing all this may quickly turn out to be more expensive than computing distances; at least for cheap distance functions like Euclidean)
I also found the old code to be a lot slower; but there might be room for optimization that doesn't precompute all neighborhoods at once. Maybe the code of the k-d-tree can also be optimized more, for single queries. It appears to do a lot of sanity checks every time?

larsmans · 2015-01-24T23:19:25Z

I think it only visits each point once to calculate neighbors, so I'm not sure what we're caching.

You're right.

I just managed to get a simple Cython/C++ version to run faster than the current code. It's almost exactly the pseudocode on Wikipedia, but with a few tiny optimizations.

Maybe the code of the k-d-tree can also be optimized more, for single queries.

Just what I was thinking. If we can put a Cython interface on that, and include a fast path for single queries, that could be a big win.

Also, I noticed it has an unused fast path for only counting the number of neighbors within the epsilon radius. That can be run in batch to determine the core points, if we can hack the sample_weight support in there.

jnothman · 2015-01-25T07:10:44Z

Keeping in mind, of course, that when LSHForest is up to scratch, we'd
still want to be able to substitute approximate for exact radius
neighbors...

On 25 January 2015 at 10:19, Lars notifications@github.com wrote:

I think it only visits each point once to calculate neighbors, so I'm
not sure what we're caching.

You're right.

I just managed to get a simple Cython/C++ version to run faster than the
current code. It's almost exactly the pseudocode on Wikipedia, but with a
few tiny optimizations.

Maybe the code of the k-d-tree can also be optimized more, for single
queries.

Just what I was thinking. If we can put a Cython interface on that, and
include a fast path for single queries, that could be a big win.

Also, I noticed it has an unused fast path for only counting the number of
neighbors within the epsilon radius. That can be run in batch to determine
the core points, if we can hack the sample_weight support in there.

—
Reply to this email directly or view it on GitHub
#4151 (comment)
.

larsmans · 2015-01-25T22:08:48Z

@jnothman In that case we also want #4157, because that changes the second phase of the algorithm to linear time. Using NumPy's setops involves sorting.

jnothman · 2015-01-26T02:22:47Z

Yes, okay. I hadn't looked at the Cython implementation; I'd presumed that "It's almost exactly the pseudocode on Wikipedia" meant you no longer calculated radius_neighbors in batch.

larsmans · 2015-01-26T10:33:37Z

Ah, no. But it is a vanilla depth-first search now, and we could plug in the neighbors query where it's currently fetching neighbors from an array.

larsmans changed the title ~~ENH: optimize DBSCAN (~10% faster on my data)~~ [MRG] optimize DBSCAN (~10% faster on my data) Jan 23, 2015

ENH: optimize DBSCAN (~10% faster on my data)

21e63aa

larsmans force-pushed the dbscan-faster branch from b73eae0 to 21e63aa Compare January 23, 2015 11:13

larsmans added a commit that referenced this pull request Jan 23, 2015

Merge pull request #4151 from larsmans/dbscan-faster

44c8083

ENH optimize DBSCAN

larsmans merged commit 44c8083 into scikit-learn:master Jan 23, 2015

larsmans deleted the dbscan-faster branch January 23, 2015 11:26

larsmans mentioned this pull request Jan 25, 2015

[MRG+1] optimize DBSCAN by rewriting in Cython #4157

Closed

jnothman mentioned this pull request Mar 3, 2015

[MRG + 1] Do not shuffle by default for DBSCAN. #4066

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[MRG] optimize DBSCAN (~10% faster on my data) #4151

[MRG] optimize DBSCAN (~10% faster on my data) #4151

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[MRG] optimize DBSCAN (~10% faster on my data) #4151

[MRG] optimize DBSCAN (~10% faster on my data) #4151

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!