8000 [MRG] optimize DBSCAN (~10% faster on my data) by larsmans · Pull Request #4151 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG] optimize DBSCAN (~10% faster on my data) #4151

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jan 23, 2015

Conversation

larsmans
Copy link
Member

I've found that, on a set of 380k samples (3 features; point clouds), radius_neighbors takes 33% of the time, np.unique 32% and intersect1d 15%. This speeds up intersect1d by a factor two and takes away some minor overhead.

Also fixes the path of an example script.

@larsmans larsmans changed the title ENH: optimize DBSCAN (~10% faster on my data) [MRG] optimize DBSCAN (~10% faster on my data) Jan 23, 2015
larsmans added a commit that referenced this pull request Jan 23, 2015
@larsmans larsmans merged commit 44c8083 into scikit-learn:master Jan 23, 2015
@larsmans larsmans deleted the dbscan-faster branch January 23, 2015 11:26
@GaelVaroquaux
Copy link
Member

Thanks!

@larsmans
Copy link
Member Author

Still not fast enough, though. The 380k dataset is one of the smaller ones...

@amueller
Copy link
Member

How about birch? Shouldn't that be faster?
Btw, most point cloud clustering search only in a spacial neighborhood ;)

@larsmans
Copy link
Member Author

But that's what DBSCAN is supposed to do, isn't it? It does radius neighbor queries and expands the clusters from those. The radius queries are also one of the two bottlenecks, the other one being the sorting in np.unique. Getting rid of the latter would require a mergesorted function, which is not available in NumPy, but maybe not very hard to write in Cython.

I already found out that it's nearly impossible to get radius queries to run faster by conventional means like recursion unrolling.

I haven't tried Birch since I don't know how it will behave. What I'm trying to do is actually segmentation by identifying the k biggest clusters. DBSCAN on simple (x,y,z) vectors works like magic, much better than any of the similarly simple options in PCL.

@amueller
Copy link
Member

Well if you have something like point clouds from a 2d depth sensor like the kinect, you already have the data in a grid-structure and don't need to build the tree.

@GaelVaroquaux
Copy link
Member

Well if you have something like point clouds from a 2d depth sensor
like the kinect, you already have the data in a grid-structure and
don't need to build the tree.

You should be using our agglomerative clustering in these situations. It
is very fast when imposing a connectivity

@amueller
Copy link
Member

Pretty sure it is not faster than the algorithms that actually encode the grid structure in the algorithm ;) (which are my contributions to scikit-image a while back)

@jnothman
Copy link
Member

Thanks, @larsmans, I wasn't aware of that option.

Some of this new implementation may be reverted given the discussions in #4066 (ping @kno10), which seem to suggest that the lower memory costs of the version at 0.15 may be worthwhile, and more in keeping with the original algorithm, while otherwise we could suggest that users for whom memory isn't a problem precompute the pairwise distance matrix (or perhaps just a sparse radius neighbors matrix) for speed. Can you please benchmark this version with your dataset, vs 0.15 with metric='precomputed'?

@larsmans
Copy link
Member Author

Precompute (3.8e5)^2 distances? Are you kidding?

@jnothman
Copy link
Member

I didn't register the numbers, it seems. And I guess we'd need to write it differently again if we accepted a precomputed radius_neighbors_graph as input, but that could be an option. Okay, thanks.

@kno10
Copy link
Contributor
kno10 commented Jan 24, 2015

I don't know if you had a look at the changes I did in e48ade5
(in my patch-2 branch) which are a bit faster for me.
From my observations: if you can afford the memory, using a distance matrix is fastest.
On low-dimensional data with Euclidean distance, the kd-tree can work reasonably well. If you have enough memory to first compute the neighborhoods (as in HEAD) this can be quite a bit faster.
If you are low in memory and use a distance that won't work with the kd-tree, the old code may be faster...

@larsmans
Copy link
Member Author

The old code is actually several orders of magnitude slower. Its only benefit is that it fits in memory. I tried cythonizing it, but that doesn't buy much. Maybe it can be improved with an LRU cache on the neighbors computations, but that will be tricky...

@jnothman
Copy link
Member

The old code is actually several orders of magnitude slower

Including with precomputed neighbors, I presume?

Maybe it can be improved with an LRU cache on the neighbors computations,

I think it only visits each point once to calculate neighbors, so I'm not
sure what we're caching.

On 25 January 2015 at 01:21, Lars notifications@github.com wrote:

The old code is actually several orders of magnitude slower. Its only
benefit is that it fits in memory. I tried cythonizing it, but that doesn't
buy much. Maybe it can be improved with an LRU cache on the neighbors
computations, but that will be tricky...


Reply to this email directly or view it on GitHub
#4151 (comment)
.

@kno10
Copy link
Contributor
kno10 commented Jan 24, 2015

DBSCAN would usually compute each distance twice; there is some potential in caching, but not that much. (You also only need to cache whether the distance was less than epsilon. But managing all this may quickly turn out to be more expensive than computing distances; at least for cheap distance functions like Euclidean)
I also found the old code to be a lot slower; but there might be room for optimization that doesn't precompute all neighborhoods at once. Maybe the code of the k-d-tree can also be optimized more, for single queries. It appears to do a lot of sanity checks every time?

@larsmans
Copy link
Member Author

I think it only visits each point once to calculate neighbors, so I'm not sure what we're caching.

You're right.

I just managed to get a simple Cython/C++ version to run faster than the current code. It's almost exactly the pseudocode on Wikipedia, but with a few tiny optimizations.

Maybe the code of the k-d-tree can also be optimized more, for single queries.

Just what I was thinking. If we can put a Cython interface on that, and include a fast path for single queries, that could be a big win.

Also, I noticed it has an unused fast path for only counting the number of neighbors within the epsilon radius. That can be run in batch to determine the core points, if we can hack the sample_weight support in there.

@jnothman
Copy link
Member

Keeping in mind, of course, that when LSHForest is up to scratch, we'd
still want to be able to substitute approximate for exact radius
neighbors...

On 25 January 2015 at 10:19, Lars notifications@github.com wrote:

I think it only visits each point once to calculate neighbors, so I'm
not sure what we're caching.

You're right.

I just managed to get a simple Cython/C++ version to run faster than the
current code. It's almost exactly the pseudocode on Wikipedia, but with a
few tiny optimizations.

Maybe the code of the k-d-tree can also be optimized more, for single
queries.

Just what I was thinking. If we can put a Cython interface on that, and
include a fast path for single queries, that could be a big win.

Also, I noticed it has an unused fast path for only counting the number of
neighbors within the epsilon radius. That can be run in batch to determine
the core points, if we can hack the sample_weight support in there.


Reply to this email directly or view it on GitHub
#4151 (comment)
.

@larsmans
Copy link
Member Author

@jnothman In that case we also want #4157, because that changes the second phase of the algorithm to linear time. Using NumPy's setops involves sorting.

@jnothman
Copy link
Member

Yes, okay. I hadn't looked at the Cython implementation; I'd presumed that "It's almost exactly the pseudocode on Wikipedia" meant you no longer calculated radius_neighbors in batch.

@larsmans
Copy link
Member Author

Ah, no. But it is a vanilla depth-first search now, and we could plug in the neighbors query where it's currently fetching neighbors from an array.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0