-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG+1] Fix regression in silhouette_score for clusters of size 1 #7438
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
681f5b8
5c66f0b
c7a2b95
cf4dcec
3c850dd
173a3a1
dbf5300
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -5,13 +5,15 @@ | |
from sklearn import datasets | ||
from sklearn.utils.testing import assert_false | ||
from sklearn.utils.testing import assert_almost_equal | ||
from sklearn.utils.testing import assert_array_equal | ||
from sklearn.utils.testing import assert_equal | ||
from sklearn.utils.testing import assert_raises_regexp | ||
from sklearn.utils.testing import assert_raise_message | ||
from sklearn.utils.testing import assert_greater | ||
from sklearn.metrics.cluster import silhouette_score | ||
from sklearn.metrics.cluster import calinski_harabaz_score | ||
from sklearn.metrics.cluster import silhouette_samples | ||
from sklearn.metrics import pairwise_distances | ||
from sklearn.metrics.cluster import calinski_harabaz_score | ||
|
||
|
||
def test_silhouette(): | ||
|
@@ -56,16 +58,27 @@ def test_silhouette(): | |
assert_almost_equal(score_euclidean, score_dense_with_sampling) | ||
|
||
|
||
def test_no_nan(): | ||
# Assert Silhouette Coefficient != nan when there is 1 sample in a class. | ||
# This tests for the condition that caused issue 960. | ||
# Note that there is only one sample in cluster 0. This used to cause the | ||
# silhouette_score to return nan (see bug #960). | ||
labels = np.array([1, 0, 1, 1, 1]) | ||
# The distance matrix doesn't actually matter. | ||
D = np.random.RandomState(0).rand(len(labels), len(labels)) | ||
silhouette = silhouette_score(D, labels, metric='precomputed') | ||
def test_cluster_size_1(): | ||
# Assert Silhouette Coefficient == 0 when there is 1 sample in a cluster | ||
# (cluster 0). We also test the case where there are identical samples | ||
# as the only members of a cluster (cluster 2). To our knowledge, this case | ||
# is not discussed in reference material, and we choose for it a sample | ||
# score of 1. | ||
X = [[0.], [1.], [1.], [2.], [3.], [3.]] | ||
labels = np.array([0, 1, 1, 1, 2, 2]) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually please update this test to have non-contiguous labels that do not start at zero, for instance: labels = np.array([1, 3, 3, 3, 4, 4]) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think that would be confusing here. Indeed there's already a "test_non_encoded_labels", and if I add a *2 in it, it's gappy. Actually, it was only testing |
||
|
||
# Cluster 0: 1 sample -> score of 0 by Rousseeuw's convention | ||
# Cluster 1: intra-cluster = [.5, .5, 1] | ||
# inter-cluster = [1, 1, 1] | ||
# silhouette = [.5, .5, 0] | ||
# Cluster 2: intra-cluster = [0, 0] | ||
# inter-cluster = [arbitrary, arbitrary] | ||
# silhouette = [1., 1.] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please rename the cluster ids (1, 2, 3) in this comment to match |
||
|
||
silhouette = silhouette_score(X, labels) | ||
assert_false(np.isnan(silhouette)) | ||
ss = silhouette_samples(X, labels) | ||
assert_array_equal(ss, [0, .5, .5, 0, 1, 1]) | ||
|
||
|
||
def test_correct_labelsize(): | ||
|
@@ -93,7 +106,9 @@ def test_non_encoded_labels(): | |
X = dataset.data | ||
labels = dataset.target | ||
assert_equal( | ||
silhouette_score(X, labels + 10), silhouette_score(X, labels)) | ||
silhouette_score(X, labels * 2 + 10), silhouette_score(X, labels)) | ||
assert_array_equal( | ||
silhouette_samples(X, labels * 2 + 10), silhouette_samples(X, labels)) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Great checks, thanks! |
||
|
||
|
||
def test_non_numpy_labels(): | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -9,6 +9,7 @@ | |
|
||
from ...utils import check_random_state | ||
from ...utils import check_X_y | ||
from ...utils.fixes import bincount | ||
from ..pairwise import pairwise_distances | ||
from ...preprocessing import LabelEncoder | ||
|
||
|
@@ -88,15 +89,8 @@ def silhouette_score(X, labels, metric='euclidean', sample_size=None, | |
<https://en.wikipedia.org/wiki/Silhouette_(clustering)>`_ | ||
|
||
""" | ||
X, labels = check_X_y(X, labels, accept_sparse=['csc', 'csr']) | ||
le = LabelEncoder() | ||
labels = le.fit_transform(labels) | ||
n_labels = len(le.classes_) | ||
n_samples = X.shape[0] | ||
|
||
check_number_of_labels(n_labels, n_samples) | ||
|
||
if sample_size is not None: | ||
X, labels = check_X_y(X, labels, accept_sparse=['csc', 'csr']) | ||
random_state = check_random_state(random_state) | ||
indices = random_state.permutation(X.shape[0])[:sample_size] | ||
if metric == "precomputed": | ||
|
@@ -166,36 +160,39 @@ def silhouette_samples(X, labels, metric='euclidean', **kwds): | |
<https://en.wikipedia.org/wiki/Silhouette_(clustering)>`_ | ||
|
||
""" | ||
X, labels = check_X_y(X, labels, accept_sparse=['csc', 'csr']) | ||
le = LabelEncoder() | ||
labels = le.fit_transform(labels) | ||
check_number_of_labels(len(le.classes_), X.shape[0]) | ||
|
||
distances = pairwise_distances(X, metric=metric, **kwds) | ||
unique_labels = le.classes_ | ||
n_samples_per_label = bincount(labels, minlength=len(unique_labels)) | ||
|
||
# For sample i, store the mean distance of the cluster to which | ||
# it belongs in intra_clust_dists[i] | ||
intra_clust_dists = np.ones(distances.shape[0], dtype=distances.dtype) | < A3E2 /tr>||
intra_clust_dists = np.zeros(distances.shape[0], dtype=distances.dtype) | ||
|
||
# For sample i, store the mean distance of the second closest | ||
# cluster in inter_clust_dists[i] | ||
inter_clust_dists = np.inf * intra_clust_dists | ||
inter_clust_dists = np.inf + intra_clust_dists | ||
|
||
for curr_label in unique_labels: | ||
for curr_label in range(len(unique_labels)): | ||
|
||
# Find inter_clust_dist for all samples belonging to the same | ||
# label. | ||
mask = labels == curr_label | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is now broken if labels are not contiguous integers starting at zero, no? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, it was broken previously ::/ |
||
current_distances = distances[mask] | ||
|
||
# Leave out current sample. | ||
n_samples_curr_lab = np.sum(mask) - 1 | ||
n_samples_curr_lab = n_samples_per_label[curr_label] - 1 | ||
if n_samples_curr_lab != 0: | ||
intra_clust_dists[mask] = np.sum( | ||
current_distances[:, mask], axis=1) / n_samples_curr_lab | ||
|
||
# Now iterate over all other labels, finding the mean | ||
# cluster distance that is closest to every sample. | ||
for other_label in unique_labels: | ||
for other_label in range(len(unique_labels)): | ||
if other_label != curr_label: | ||
other_mask = labels == other_label | ||
other_distances = np.mean( | ||
|
@@ -205,6 +202,8 @@ def silhouette_samples(X, labels, metric='euclidean', **kwds): | |
|
||
sil_samples = inter_clust_dists - intra_clust_dists | ||
sil_samples /= np.maximum(intra_clust_dists, inter_clust_dists) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This will still raise a division by zero warning if all samples are duplicate of one another and all belong to the same cluster. But maybe we don't care that much of such a rare edge case... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also if they belong to different cluster. Quick example. silhouette_samples([[1.0], [1.0], [1.0], [1.0]], [0, 0, 1, 1]) However, according to the formula, this might be expected behaviour. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If all samples belong to one cluster, There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fair enough then. |
||
# score 0 for clusters of size 1, according to the paper | ||
sil_samples[n_samples_per_label.take(labels) == 1] = 0 | ||
return sil_samples | ||
|
||
|
||
|
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is having identical samples in a cluster a special case? It comes directly from the formula, right? Here
After fixing
i
,a(i)
can be viewed as mean distance to all other identical points which is zero andb(i)
can be anything. So the formula reduces tob(i) / b(i)
which is 1, no?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a special case, it just represents a non-smooth part of the function, where you might have expected, for clustering evaluation, that duplicating a point and assigning it the same label should not affect the score.
As a matter of implementation, I could have, even accidentally, handled the single-point cluster case in a way that assigned this 0, so it's important that we make a decision on it and test it.