[go: up one dir, main page]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX Improve HDBSCAN Error Message when checking for connected components #27678

Merged
merged 9 commits into from
Dec 4, 2023
6 changes: 5 additions & 1 deletion doc/whats_new/v1.4.rst
Original file line number Diff line number Diff line change
Expand Up @@ -248,7 +248,11 @@ Changelog
`kdtree` and `balltree` values will be removed in 1.6.
:pr:`26744` by :user:`Shreesha Kumar Bhat <Shreesha3112>`.

- |FIX| : Create copy of precomputed sparse matrix within the
- |Fix| Improve error message when checking the number of connected components
in the `fit` method of :class:`cluster.HDBSCAN`.
:pr:`27678` by :user:`Ganesh Tata <tataganesh>`.

- |Fix| Create copy of precomputed sparse matrix within the
`fit` method of `cluster.DBSCAN` to avoid in-place modification of
the sparse matrix.
:pr:`27651` by :user:`Ganesh Tata <tataganesh>`.
Expand Down
26 changes: 17 additions & 9 deletions sklearn/cluster/_hdbscan/hdbscan.py
Original file line number Diff line number Diff line change
Expand Up @@ -104,22 +104,30 @@ def _brute_mst(mutual_reachability, min_samples):
if not issparse(mutual_reachability):
return mst_from_mutual_reachability(mutual_reachability)

# Check connected component on mutual reachability
# If more than one component, it means that even if the distance matrix X
# has one component, there exists with less than `min_samples` neighbors
if (
csgraph.connected_components(
mutual_reachability, directed=False, return_labels=False
)
> 1
):
# Check if the mutual reachability matrix has any rows which have
# less than `min_samples` non-zero elements.
indptr = mutual_reachability.indptr
num_points = mutual_reachability.shape[0]
if any((indptr[i + 1] - indptr[i]) < min_samples for i in range(num_points)):
raise ValueError(
f"There exists points with fewer than {min_samples} neighbors. Ensure"
" your distance matrix has non-zero values for at least"
f" `min_sample`={min_samples} neighbors for each points (i.e. K-nn"
" graph), or specify a `max_distance` in `metric_params` to use when"
" distances are missing."
)
# Check connected component on mutual reachability.
# If more than one connected component is present,
# it means that the graph is disconnected.
n_components = csgraph.connected_components(
mutual_reachability, directed=False, return_labels=False
)
if n_components > 1:
raise ValueError(
f"Sparse mutual reachability matrix has {n_components} connected"
" components. HDBSCAN cannot be perfomed on a disconnected graph. Ensure"
" that the sparse distance matrix has only one connected component."
)

# Compute the minimum spanning tree for the sparse graph
sparse_min_spanning_tree = csgraph.minimum_spanning_tree(mutual_reachability)
Expand Down
17 changes: 17 additions & 0 deletions sklearn/cluster/tests/test_hdbscan.py
Original file line number Diff line number Diff line change
Expand Up @@ -401,6 +401,23 @@ def test_hdbscan_sparse_distances_too_few_nonzero(csr_container):
HDBSCAN(metric="precomputed").fit(X)


@pytest.mark.parametrize("csr_container", CSR_CONTAINERS)
def test_hdbscan_sparse_distances_disconnected_graph(csr_container):
"""
Tests that HDBSCAN raises the correct error when the distance matrix
has multiple connected components.
"""
# Create symmetric sparse matrix with 2 connected components
X = np.zeros((20, 20))
X[:5, :5] = 1
X[5:, 15:] = 1
X = X + X.T
X = csr_container(X)
msg = "HDBSCAN cannot be perfomed on a disconnected graph"
with pytest.raises(ValueError, match=msg):
HDBSCAN(metric="precomputed").fit(X)


def test_hdbscan_tree_invalid_metric():
"""
Tests that HDBSCAN correctly raises an error for invalid metric choices.
Expand Down