8000 [MRG] Faster vectorization of DBSCAN (plain python) by kno10 · Pull Request #4334 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG] Faster vectorization of DBSCAN (plain python) #4334

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

[MRG] Faster vectorization of DBSCAN (plain python) #4334

wants to merge 1 commit into from

Conversation

kno10
Copy link
Contributor
@kno10 kno10 commented Mar 4, 2015

This pull request includes: fixes #4066 fixes #4073
This is still a plain python variant of DBSCAN, with the same drawbacks (worst case O(n^2) memory due to materialization of neighborhoods) as the current head version. But I vectorized operations differently, which is strictly faster on all my experiments. As such, it may also be useful as baseline for benchmarking the pending Cython rewrite (#4157) and is less invasive.

@ogrisel
Copy link
Member
ogrisel commented Mar 5, 2015

Thanks @kno10. Could you please run some benchmarks to compare @larsmans' Cython rewrite of #4157 and this? You can use data such as generated in this notebook for instance, maybe with more dimensions (e.g. 100) and more samples (e.g. up to 1e6).

@ogrisel ogrisel changed the title MRG: Faster vectorization of DBSCAN (plain python) [MRG] Faster vectorization of DBSCAN (plain python) Mar 5, 2015
@kno10
Copy link
Contributor Author
kno10 commented Mar 5, 2015

Instead of artificial data, I used 50k real GPS coordinates from Twitter, with parameters so that about 1/3rd of the points are in 6 clusters. I think the DBSCAN parameters were chosen realistically, without overfitting.
master: 1.96753382683 s prior to assume_unique=True addition
patch-1: 1.96305990219 s #4066
patch-2: 1.19516587257 s #4334
cython: 1.19392514229 s #4157
0.15.2: 10.0197000504 s

470k instances, different parameters to account for the larger size.
master: 26.3678278923 s prior to assume_unique=True addition
patch-1: 24.5347480774 s #4066
patch-2: 18.3952949047 s #4334
cython: 14.6483030319 s #4157

So this time, the cython version was faster. It's not as pronounced as expected, and the result is currently in line with master, not with the adjustments in #4066 to align it with literature. This can supposedly be changed easily (essentially, "off by one" in a parameter).

@ogrisel
Copy link
Member
ogrisel commented Mar 5, 2015

Thank you very much @kno10, could you please also profile the memory usage with memory_profiler?

@ogrisel
Copy link
Member
ogrisel commented Mar 5, 2015

@kno10 now that I merged the boundary / no-shuffle fix to master you might need to rebase the actual fix and strip the other commits from that PR and git push -f to get github to actually display the real meat of this PR.

@kno10
Copy link
Contributor Author
kno10 commented Mar 5, 2015

Rebased from patch-1 to master, and pushed again. I tried memory profiling, but they all look the same to me memory-wise. If there is a difference, it's reported in lines that are identical across these versions (the memory hungy operation is the computation of all neighbors; which is the same).

@@ -137,6 +138,9 @@ def dbscan(X, eps=0.5, min_samples=5, metric='minkowski',
# label_num is the label given to the new cluster
label_num = 0

# Candidates
ctmp = np.zeros(X.shape[0], dtype=np.int8)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please give this a better name rather than a comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"candidates_temporary"? It's a temporary collection; there is a proper "candidates" there, 8000 too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tmp_candidates or candidates_tmp then.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-reading the code I think that candidates_mask would be a more descriptive name.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The declaration could read:

# Temporary boolean mask to mark core-samples as candidates for neighborhood
# expansion in the inner labeling loop.
# 1 means that it should be considered as core-sample candidate for the next round of
# expansions. 0 is used as a marker to skip a core-sample that has already been labeled
# in a past iteration. 
candidates_mask = np.zeros(X.shape[0], dtype=np.int8)

@ogrisel
Copy link
Member
ogrisel commented Mar 6, 2015

Rebased from patch-1 to master, and pushed again. I tried memory profiling, but they all look the same to me memory-wise. If there is a difference, it's reported in lines that are identical across these versions (the memory hungy operation is the computation of all neighbors; which is the same).

Thanks for checking. Based on your benchmarks it seems that @larsmans' Cython code has a slight speed advantage but I have not reviewed the actual implementation to check whether it's more in line with the "official" DBSCAN from the literature or not.

BTW, is your benchmark data part of a public dataset? It would be interesting to have such an high n_samples, low n_features datasets for running realistic clustering experiments. I think clustering geo data is a pretty common use case to identify POIs and clean up / summarize mobile data.

@kno10
Copy link
Contributor Author
kno10 commented Mar 6, 2015

@larsmans code supposedly is using a spanning tree, and may thus return slightly different results on border points. My version should yield the exact same result as published DBSCAN.
Sorry, per Twitter ToS I cannot make the data available. I could only make Tweet IDs available, which you could then bulk-download again; but you can then just use the sampler or search APIs yourself and build a similar, newer data set.
I would not rely on Twitter data for anything serious. There is too much spam, and too many fake coordinates. Density variations are too large for DBSCAN to be useful, and k-means is worse than random. POI generation and these are often given as use cases, but I don't think they actually work well; too much bias that you cannot control or account for.

assume_unique=True)
if len(candidates) == 0: break
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please put break on its own line to follow PEP8 conventions.

8000
@ogrisel
Copy link
Member
ogrisel commented Mar 6, 2015

@larsmans code supposedly is using a spanning tree, and may thus return slightly different results on border points. My version should yield the exact same result as published DBSCAN.

+1 for trying to stick to the official algorithm as the speed difference is not that important (at least on that data).

What do other people think?

@GaelVaroquaux
Copy link
Member
GaelVaroquaux commented Mar 6, 2015 via email

@ogrisel
Copy link
Member
ogrisel commented Mar 6, 2015

I already asked and @kno10 answered that the memory bottleneck is not in the section impacted by this PR nor the alternative Cython refactoring by @larsmans. The memory allocation are done when pre-computing the neighborhoods in the first part of the dbscan function.

@kno10
Copy link
Contributor Author
kno10 commented Mar 7, 2015

Pushed a version with the suggested changes.

This version benchmarks 20%-40% faster for me than current head,
and ~90% faster than 0.15.2.
It also was faster than using a precomputed distance matrix,
even when the data set was small enough to do so
(I used 10k and 50k coordinates from Twitter).
@larsmans
Copy link
Member
larsmans commented Mar 9, 2015

I'm not using spanning trees. My code uses a simple depth-first search, stopping at non-core points. That's what the original does, too.

@kno10
Copy link
Contributor Author
kno10 commented Mar 11, 2015

Sorry, I thought you had done that. Because there are followups to DBSCAN like this, computing a density-based spanning tree.

@ogrisel
Copy link
Member
ogrisel commented Mar 13, 2015

It seems that this alternative Python implementation is not always faster than the current implementation in master:

  • on this branch:
Fitting dbscan on (500000, 2)-shaped blob data (with uniform noise)
Duration: 68.225s
Found 347 clusters
NMI: 0.8920
ARI: 0.6815
  • on master:
Fitting dbscan on (500000, 2)-shaped blob data (with uniform noise)
Duration: 46.335s
Found 347 clusters
NMI: 0.8920
ARI: 0.6815

(note that 2 consecutive runs can vary by ~5s on the same branch).

@ogrisel
Copy link
Member
ogrisel commented Mar 13, 2015

The code of the previous benchmark is available in #4157 (comment) if you want to reproduce.

@ogrisel
Copy link
Member
ogrisel commented Mar 13, 2015

@kno10 I feel that the Cython implementation is more promising. It is always significantly master than master in my experiments (contrary to this PR) and the Cython code is still quite straightforward to understand.

I would be in favor to close this experimental PR in favor of #4157. what do you think?

@ogrisel
Copy link
Member
ogrisel commented Mar 17, 2015

Closing this as #4157 has been merged in master. Thanks @kno10 for your contributions to improve the DBSCAN speed.

@ogrisel ogrisel closed this Mar 17, 2015
@kno10 kno10 deleted the patch-2 branch March 19, 2015 14:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
0