-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
CLN Cleaned cluster/_hdbscan/_reachability.pyx
#24701
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CLN Cleaned cluster/_hdbscan/_reachability.pyx
#24701
Conversation
cluster/_hdbscan/_reachability.pyx
I am not sure the reason, but the name of the branch does not allow me to checkout your branch. |
I opened Micky774#5 to address some naming issues and a couple of Cython improvement. |
…lity MAINT additional cleaning in reachibility.pyx
e1b52cc
to
b83f614
Compare
From the other PR:
Yes, this is a critical assumption. I've updated the code to include some validation on the precomputed distance matrix to ensure symmetry. As an aside, in other parts of the codebase that assume symmetric matrices, I know we check for at least squareness but I'm not sure if we check for actual symmetry. It's not expensive using |
It might be worth using try:
assert_allclose_dense_sparse(X, X.T)
except AssertError as exc:
raise ValueError(...) from exc |
In scikit-learn/sklearn/utils/validation.py Line 1774 in 3e47fa9
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, this looks good.
distance_matrix, further_neighbor_idx, axis=1 | ||
)[:, further_neighbor_idx] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using [:, further_neighbor_idx]
here, means that core_distances
is not contiguous anymore. Does this lead to a performance regression here?
I'm okay with partitioning in the original code, to keep core_distances
contiguous.
Also, I think that slicing the original 2d array from np.partition
is another instance of #17299. The workaround is to use further_neighbor_idx
to index core_distances
when computing the mutual_reachibility_distance
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out there is no significant performance or memory difference between the current non-contiguous setup and one where c-contiguity is enforced:
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.datasets import make_blobs
from sklearn.cluster._hdbscan._reachability import _dense_mutual_reachability_graph
X, _ = make_blobs(n_samples=20_000, random_state=10)
D = euclidean_distances(X)
%timeit -n 5 _dense_mutual_reachability_graph(D, 5)
non-contig: 2.66 s ± 73.3 ms per loop (mean ± std. dev. of 7 runs, 5 loops each)
contig: 2.74 s ± 197 ms per loop (mean ± std. dev. of 7 runs, 5 loops each)
non-contiguous / contiguous
Total number of allocations: 92508/93065
Total number of frames seen: 7109/7115
Peak memory usage: 6.6 GB/6.6 GB
Currently we do not require that distance_matrix
is actually contiguous, however that would need to be the case if we enforce that core_distances
is contiguous. Granted there's no performance or memory gains, I'm not sure it's worth changing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the benchmark! Probably, the delta needs to be assessed based on what takes the longer to run on _dense_mutual_reachability_graph
. This is inspectable using profilers like py-spy
for instance.
Given the scope of this PR, I am OK pursuing as such and exploring potential performance improvements in subsequent PRs. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the tip to use py-spy
, it's quite nice for this.
Looking at it again, it seems that swapping theprange
to the outer loop and enforcing a contiguous 1D core_distances
array is significantly faster. Before, the dense loop occupied ~30%
of the runtime of _dense_mutual_reachability_graph
whereas with the changes the new proportion is ~1%
. I tested whether indexing into the 2D core_distances
or creating a 1D core_distances
is preferable, and it seems the latter is notably faster, most likely due to caching.
The new forced contiguity system is still faster even when dealing with non-contiguous inputs. Flame graphs below (not sure what the best way to share them would be).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the report.
I think posting .svg
profiles of py-spy
results interpretable by Speedscope with:
py-spy record --native -o py-spy-profile.svg -f speedscope -- python ./perf.py
might be the most adapted for others to inspect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Micky774.
Here are a few comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM up to a few suggestions. Thank you, @Micky774!
Also the linting failure on the CI looks unrelated (the Azure worker seems to have failed).
Edit: since @thomasjpfan has reviewed and since there is one unresolved thread, I would wait for @thomasjpfan to approve this PR before merging it.
distance_matrix, further_neighbor_idx, axis=1 | ||
)[:, further_neighbor_idx] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the benchmark! Probably, the delta needs to be assessed based on what takes the longer to run on _dense_mutual_reachability_graph
. This is inspectable using profilers like py-spy
for instance.
Given the scope of this PR, I am OK pursuing as such and exploring potential performance improvements in subsequent PRs. What do you think?
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
@thomasjpfan: should we merge? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good now. One last concern about using prange
.
This reverts commit cc52631.
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a last comment.
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz> Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com> Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Reference Issues/PRs
Towards #24686
What does this implement/fix? Explain your changes.
Removes unnecessary imports and provides clarifying
TODO
comment regarding future implementation of algorithm.Any other comments?
Please feel free to suggest stylistic changes to include to
sklearn/cluster/_hdbscan/_reachability.pyx
in this PR, however the main work to be done in that file is rewriting theLIL
-based sparse mutual reachability algorithm to aCSR
-based algorithm, which is currently slated as a follow-up PR after release.