-
-
Notifications
You must be signed in to change notification settings - Fork 26k
[MRG] optimize DBSCAN (~10% faster on my data) #4151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
b73eae0
to
21e63aa
Compare
Thanks! |
Still not fast enough, though. The 380k dataset is one of the smaller ones... |
How about birch? Shouldn't that be faster? |
But that's what DBSCAN is supposed to do, isn't it? It does radius neighbor queries and expands the clusters from those. The radius queries are also one of the two bottlenecks, the other one being the sorting in I already found out that it's nearly impossible to get radius queries to run faster by conventional means like recursion unrolling. I haven't tried Birch since I don't know how it will behave. What I'm trying to do is actually segmentation by identifying the k biggest clusters. DBSCAN on simple (x,y,z) vectors works like magic, much better than any of the similarly simple options in PCL. |
Well if you have something like point clouds from a 2d depth sensor like the kinect, you already have the data in a grid-structure and don't need to build the tree. |
You should be using our agglomerative clustering in these situations. It |
Pretty sure it is not faster than the algorithms that actually encode the grid structure in the algorithm ;) (which are my contributions to scikit-image a while back) |
Thanks, @larsmans, I wasn't aware of that option. Some of this new implementation may be reverted given the discussions in #4066 (ping @kno10), which seem to suggest that the lower memory costs of the version at 0.15 may be worthwhile, and more in keeping with the original algorithm, while otherwise we could suggest that users for whom memory isn't a problem precompute the pairwise distance matrix (or perhaps just a sparse radius neighbors matrix) for speed. Can you please benchmark this version with your dataset, vs 0.15 with metric='precomputed'? |
Precompute (3.8e5)^2 distances? Are you kidding? |
I didn't register the numbers, it seems. And I guess we'd need to write it differently again if we accepted a precomputed |
I don't know if you had a look at the changes I did in e48ade5 |
The old code is actually several orders of magnitude slower. Its only benefit is that it fits in memory. I tried cythonizing it, but that doesn't buy much. Maybe it can be improved with an LRU cache on the neighbors computations, but that will be tricky... |
Including with precomputed neighbors, I presume?
I think it only visits each point once to calculate neighbors, so I'm not On 25 January 2015 at 01:21, Lars notifications@github.com wrote:
|
DBSCAN would usually compute each distance twice; there is some potential in caching, but not that much. (You also only need to cache whether the distance was less than epsilon. But managing all this may quickly turn out to be more expensive than computing distances; at least for cheap distance functions like Euclidean) |
You're right. I just managed to get a simple Cython/C++ version to run faster than the current code. It's almost exactly the pseudocode on Wikipedia, but with a few tiny optimizations.
Just what I was thinking. If we can put a Cython interface on that, and include a fast path for single queries, that could be a big win. Also, I noticed it has an unused fast path for only counting the number of neighbors within the epsilon radius. That can be run in batch to determine the core points, if we can hack the |
Keeping in mind, of course, that when LSHForest is up to scratch, we'd On 25 January 2015 at 10:19, Lars notifications@github.com wrote:
|
Yes, okay. I hadn't looked at the Cython implementation; I'd presumed that "It's almost exactly the pseudocode on Wikipedia" meant you no longer calculated |
Ah, no. But it is a vanilla depth-first search now, and we could plug in the neighbors query where it's currently fetching neighbors from an array. |
I've found that, on a set of 380k samples (3 features; point clouds),
radius_neighbors
takes 33% of the time,np.unique
32% andintersect1d
15%. This speeds upintersect1d
by a factor two and takes away some minor overhead.Also fixes the path of an example script.