[MRG] Faster vectorization of DBSCAN (plain python) #4334

kno10 · 2015-03-04T10:56:58Z

This pull request includes: fixes #4066 fixes #4073
This is still a plain python variant of DBSCAN, with the same drawbacks (worst case O(n^2) memory due to materialization of neighborhoods) as the current head version. But I vectorized operations differently, which is strictly faster on all my experiments. As such, it may also be useful as baseline for benchmarking the pending Cython rewrite (#4157) and is less invasive.

ogrisel · 2015-03-05T20:27:46Z

Thanks @kno10. Could you please run some benchmarks to compare @larsmans' Cython rewrite of #4157 and this? You can use data such as generated in this notebook for instance, maybe with more dimensions (e.g. 100) and more samples (e.g. up to 1e6).

kno10 · 2015-03-05T22:14:24Z

Instead of artificial data, I used 50k real GPS coordinates from Twitter, with parameters so that about 1/3rd of the points are in 6 clusters. I think the DBSCAN parameters were chosen realistically, without overfitting.
master: 1.96753382683 s prior to assume_unique=True addition
patch-1: 1.96305990219 s #4066
patch-2: 1.19516587257 s #4334
cython: 1.19392514229 s #4157
0.15.2: 10.0197000504 s

470k instances, different parameters to account for the larger size.
master: 26.3678278923 s prior to assume_unique=True addition
patch-1: 24.5347480774 s #4066
patch-2: 18.3952949047 s #4334
cython: 14.6483030319 s #4157

So this time, the cython version was faster. It's not as pronounced as expected, and the result is currently in line with master, not with the adjustments in #4066 to align it with literature. This can supposedly be changed easily (essentially, "off by one" in a parameter).

ogrisel · 2015-03-05T22:28:57Z

Thank you very much @kno10, could you please also profile the memory usage with memory_profiler?

ogrisel · 2015-03-05T22:37:39Z

@kno10 now that I merged the boundary / no-shuffle fix to master you might need to rebase the actual fix and strip the other commits from that PR and git push -f to get github to actually display the real meat of this PR.

kno10 · 2015-03-05T22:55:14Z

Rebased from patch-1 to master, and pushed again. I tried memory profiling, but they all look the same to me memory-wise. If there is a difference, it's reported in lines that are identical across these versions (the memory hungy operation is the computation of all neighbors; which is the same).

jnothman · 2015-03-06T01:52:39Z

sklearn/cluster/dbscan_.py

@@ -137,6 +138,9 @@ def dbscan(X, eps=0.5, min_samples=5, metric='minkowski',
    # label_num is the label given to the new cluster
    label_num = 0

+    # Candidates
+    ctmp = np.zeros(X.shape[0], dtype=np.int8)


Please give this a better name rather than a comment

"candidates_temporary"? It's a temporary collection; there is a proper "candidates" there, too.

tmp_candidates or candidates_tmp then.

Re-reading the code I think that candidates_mask would be a more descriptive name.

The declaration could read:

# Temporary boolean mask to mark core-samples as candidates for neighborhood # expansion in the inner labeling loop. # 1 means that it should be considered as core-sample candidate for the next round of # expansions. 0 is used as a marker to skip a core-sample that has already been labeled # in a past iteration. candidates_mask = np.zeros(X.shape[0], dtype=np.int8)

ogrisel · 2015-03-06T14:29:27Z

Rebased from patch-1 to master, and pushed again. I tried memory profiling, but they all look the same to me memory-wise. If there is a difference, it's reported in lines that are identical across these versions (the memory hungy operation is the computation of all neighbors; which is the same).

Thanks for checking. Based on your benchmarks it seems that @larsmans' Cython code has a slight speed advantage but I have not reviewed the actual implementation to check whether it's more in line with the "official" DBSCAN from the literature or not.

BTW, is your benchmark data part of a public dataset? It would be interesting to have such an high n_samples, low n_features datasets for running realistic clustering experiments. I think clustering geo data is a pretty common use case to identify POIs and clean up / summarize mobile data.

kno10 · 2015-03-06T14:45:49Z

@larsmans code supposedly is using a spanning tree, and may thus return slightly different results on border points. My version should yield the exact same result as published DBSCAN.
Sorry, per Twitter ToS I cannot make the data available. I could only make Tweet IDs available, which you could then bulk-download again; but you can then just use the sampler or search APIs yourself and build a similar, newer data set.
I would not rely on Twitter data for anything serious. There is too much spam, and too many fake coordinates. Density variations are too large for DBSCAN to be useful, and k-means is worse than random. POI generation and these are often given as use cases, but I don't think they actually work well; too much bias that you cannot control or account for.

ogrisel · 2015-03-06T14:53:29Z

sklearn/cluster/dbscan_.py

                                        assume_unique=True)
+            if len(candidates) == 0: break


Please put break on its own line to follow PEP8 conventions.

ogrisel · 2015-03-06T15:02:07Z

@larsmans code supposedly is using a spanning tree, and may thus return slightly different results on border points. My version should yield the exact same result as published DBSCAN.

+1 for trying to stick to the official algorithm as the speed difference is not that important (at least on that data).

What do other people think?

GaelVaroquaux · 2015-03-06T15:13:13Z

Any memory usage difference?

ogrisel · 2015-03-06T15:34:19Z

I already asked and @kno10 answered that the memory bottleneck is not in the section impacted by this PR nor the alternative Cython refactoring by @larsmans. The memory allocation are done when pre-computing the neighborhoods in the first part of the dbscan function.

kno10 · 2015-03-07T18:18:28Z

Pushed a version with the suggested changes.

This version benchmarks 20%-40% faster for me than current head, and ~90% faster than 0.15.2. It also was faster than using a precomputed distance matrix, even when the data set was small enough to do so (I used 10k and 50k coordinates from Twitter).

larsmans · 2015-03-09T10:24:59Z

I'm not using spanning trees. My code uses a simple depth-first search, stopping at non-core points. That's what the original does, too.

kno10 · 2015-03-11T09:46:44Z

Sorry, I thought you had done that. Because there are followups to DBSCAN like this, computing a density-based spanning tree.

ogrisel · 2015-03-13T11:21:55Z

It seems that this alternative Python implementation is not always faster than the current implementation in master:

on this branch:

Fitting dbscan on (500000, 2)-shaped blob data (with uniform noise)
Duration: 68.225s
Found 347 clusters
NMI: 0.8920
ARI: 0.6815

on master:

Fitting dbscan on (500000, 2)-shaped blob data (with uniform noise)
Duration: 46.335s
Found 347 clusters
NMI: 0.8920
ARI: 0.6815

(note that 2 consecutive runs can vary by ~5s on the same branch).

ogrisel · 2015-03-13T11:22:42Z

The code of the previous benchmark is available in #4157 (comment) if you want to reproduce.

ogrisel · 2015-03-13T14:41:47Z

@kno10 I feel that the Cython implementation is more promising. It is always significantly master than master in my experiments (contrary to this PR) and the Cython code is still quite straightforward to understand.

I would be in favor to close this experimental PR in favor of #4157. what do you think?

ogrisel · 2015-03-17T17:28:56Z

Closing this as #4157 has been merged in master. Thanks @kno10 for your contributions to improve the DBSCAN speed.

ogrisel changed the title ~~MRG: Faster vectorization of DBSCAN (plain python)~~ [MRG] Faster vectorization of DBSCAN (plain python) Mar 5, 2015

jnothman reviewed Mar 6, 2015
View reviewed changes

ogrisel reviewed Mar 6, 2015
View reviewed changes

Faster vectorized version of DBSCAN

fa6e215

This version benchmarks 20%-40% faster for me than current head, and ~90% faster than 0.15.2. It also was faster than using a precomputed distance matrix, even when the data set was small enough to do so (I used 10k and 50k coordinates from Twitter).

ogrisel closed this Mar 17, 2015

kno10 deleted the patch-2 branch March 19, 2015 14:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[MRG] Faster vectorization of DBSCAN (plain python) #4334

[MRG] Faster vectorization of DBSCAN (plain python) #4334

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[MRG] Faster vectorization of DBSCAN (plain python) #4334

[MRG] Faster vectorization of DBSCAN (plain python) #4334

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!