-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG] Faster vectorization of DBSCAN (plain python) #4334
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks @kno10. Could you please run some benchmarks to compare @larsmans' Cython rewrite of #4157 and this? You can use data such as generated in this notebook for instance, maybe with more dimensions (e.g. 100) and more samples (e.g. up to 1e6). |
Instead of artificial data, I used 50k real GPS coordinates from Twitter, with parameters so that about 1/3rd of the points are in 6 clusters. I think the DBSCAN parameters were chosen realistically, without overfitting. 470k instances, different parameters to account for the larger size. So this time, the cython version was faster. It's not as pronounced as expected, and the result is currently in line with master, not with the adjustments in #4066 to align it with literature. This can supposedly be changed easily (essentially, "off by one" in a parameter). |
Thank you very much @kno10, could you please also profile the memory usage with memory_profiler? |
@kno10 now that I merged the boundary / no-shuffle fix to master you might need to rebase the actual fix and strip the other commits from that PR and |
Rebased from patch-1 to master, and pushed again. I tried memory profiling, but they all look the same to me memory-wise. If there is a difference, it's reported in lines that are identical across these versions (the memory hungy operation is the computation of all neighbors; which is the same). |
@@ -137,6 +138,9 @@ def dbscan(X, eps=0.5, min_samples=5, metric='minkowski', | |||
# label_num is the label given to the new cluster | |||
label_num = 0 | |||
|
|||
# Candidates | |||
ctmp = np.zeros(X.shape[0], dtype=np.int8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please give this a better name rather than a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"candidates_temporary"? It's a temporary collection; there is a proper "candidates" there, 8000 too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tmp_candidates
or candidates_tmp
then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re-reading the code I think that candidates_mask
would be a more descriptive name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The declaration could read:
# Temporary boolean mask to mark core-samples as candidates for neighborhood
# expansion in the inner labeling loop.
# 1 means that it should be considered as core-sample candidate for the next round of
# expansions. 0 is used as a marker to skip a core-sample that has already been labeled
# in a past iteration.
candidates_mask = np.zeros(X.shape[0], dtype=np.int8)
Thanks for checking. Based on your benchmarks it seems that @larsmans' Cython code has a slight speed advantage but I have not reviewed the actual implementation to check whether it's more in line with the "official" DBSCAN from the literature or not. BTW, is your benchmark data part of a public dataset? It would be interesting to have such an high |
@larsmans code supposedly is using a spanning tree, and may thus return slightly different results on border points. My version should yield the exact same result as published DBSCAN. |
assume_unique=True) | ||
if len(candidates) == 0: break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please put break
on its own line to follow PEP8 conventions.
+1 for trying to stick to the official algorithm as the speed difference is not that important (at least on that data). What do other people think? |
Any memory usage difference?
|
Pushed a version with the suggested changes. |
This version benchmarks 20%-40% faster for me than current head, and ~90% faster than 0.15.2. It also was faster than using a precomputed distance matrix, even when the data set was small enough to do so (I used 10k and 50k coordinates from Twitter).
I'm not using spanning trees. My code uses a simple depth-first search, stopping at non-core points. That's what the original does, too. |
Sorry, I thought you had done that. Because there are followups to DBSCAN like this, computing a density-based spanning tree. |
It seems that this alternative Python implementation is not always faster than the current implementation in master:
(note that 2 consecutive runs can vary by ~5s on the same branch). |
The code of the previous benchmark is available in #4157 (comment) if you want to reproduce. |
@kno10 I feel that the Cython implementation is more promising. It is always significantly master than master in my experiments (contrary to this PR) and the Cython code is still quite straightforward to understand. I would be in favor to close this experimental PR in favor of #4157. what do you think? |
This pull request includes: fixes #4066 fixes #4073
This is still a plain python variant of DBSCAN, with the same drawbacks (worst case O(n^2) memory due to materialization of neighborhoods) as the current head version. But I vectorized operations differently, which is strictly faster on all my experiments. As such, it may also be useful as baseline for benchmarking the pending Cython rewrite (#4157) and is less invasive.