8000 [MRG+1] Fix regression in silhouette_score for clusters of size 1 by jnothman · Pull Request #7438 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG+1] Fix regression in silhouette_score for clusters of size 1 #7438

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Sep 22, 2016
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions doc/whats_new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -467,6 +467,14 @@ Bug fixes
(`#7350 <https://github.com/scikit-learn/scikit-learn/pull/7350/>`_)
By `Russell Smith`_.

- Fix bug in :func:`metrics.silhouette_score` in which clusters of
size 1 were incorrectly scored. They should get a score of 0.
By `Joel Nothman`_.

- Fix bug in :func:`metrics.silhouette_samples` so that it now works with
arbitrary labels, not just those ranging from 0 to n_clusters - 1.
By `Joel Nothman`_.


API changes summary
-------------------
Expand Down
37 changes: 26 additions & 11 deletions sklearn/metrics/cluster/tests/test_unsupervised.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,15 @@
from sklearn import datasets
from sklearn.utils.testing import assert_false
from sklearn.utils.testing import assert_almost_equal
from sklearn.utils.testing import assert_array_equal
from sklearn.utils.testing import assert_equal
from sklearn.utils.testing import assert_raises_regexp
from sklearn.utils.testing import assert_raise_message
from sklearn.utils.testing import assert_greater
from sklearn.metrics.cluster import silhouette_score
from sklearn.metrics.cluster import calinski_harabaz_score
from sklearn.metrics.cluster import silhouette_samples
from sklearn.metrics import pairwise_distances
from sklearn.metrics.cluster import calinski_harabaz_score


def test_silhouette():
Expand Down Expand Up @@ -56,16 +58,27 @@ def test_silhouette():
assert_almost_equal(score_euclidean, score_dense_with_sampling)


def test_no_nan():
# Assert Silhouette Coefficient != nan when there is 1 sample in a class.
# This tests for the condition that caused issue 960.
# Note that there is only one sample in cluster 0. This used to cause the
# silhouette_score to return nan (see bug #960).
labels = np.array([1, 0, 1, 1, 1])
# The distance matrix doesn't actually matter.
D = np.random.RandomState(0).rand(len(labels), len(labels))
silhouette = silhouette_score(D, labels, metric='precomputed')
def test_cluster_size_1():
# Assert Silhouette Coefficient == 0 when there is 1 sample in a cluster
# (cluster 0). We also test the case where there are identical samples
# as the only members of a cluster (cluster 2). To our knowledge, this case
# is not discussed in reference material, and we choose for it a sample
# score of 1.
Copy link
Member
@MechCoder MechCoder Sep 21, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is having identical samples in a cluster a special case? It comes directly from the formula, right? Here

After fixing i, a(i) can be viewed as mean distance to all other identical points which is zero and b(i) can be anything. So the formula reduces to b(i) / b(i) which is 1, no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is having identical samples in a cluster a special case?

It's not a special case, it just represents a non-smooth part of the function, where you might have expected, for clustering evaluation, that duplicating a point and assigning it the same label should not affect the score.

As a matter of implementation, I could have, even accidentally, handled the single-point cluster case in a way that assigned this 0, so it's important that we make a decision on it and test it.

X = [[0.], [1.], [1.], [2.], [3.], [3.]]
labels = np.array([0, 1, 1, 1, 2, 2])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually please update this test to have non-contiguous labels that do not start at zero, for instance:

labels = np.array([1, 3, 3, 3, 4, 4])

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would be confusing here. Indeed there's already a "test_non_encoded_labels", and if I add a *2 in it, it's gappy. Actually, it was only testing silhouette_score not silhouette_samples. Changed. Also removed the double encoding of labels from silhouette_score.


# Cluster 0: 1 sample -> score of 0 by Rousseeuw's convention
# Cluster 1: intra-cluster = [.5, .5, 1]
# inter-cluster = [1, 1, 1]
# silhouette = [.5, .5, 0]
# Cluster 2: intra-cluster = [0, 0]
# inter-cluster = [arbitrary, arbitrary]
# silhouette = [1., 1.]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rename the cluster ids (1, 2, 3) in this comment to match 0, 1, and 2 used in the labels variable.


silhouette = silhouette_score(X, labels)
assert_false(np.isnan(silhouette))
ss = silhouette_samples(X, labels)
assert_array_equal(ss, [0, .5, .5, 0, 1, 1])


def test_correct_labelsize():
Expand Down Expand Up @@ -93,7 +106,9 @@ def test_non_encoded_labels():
X = dataset.data
labels = dataset.target
assert_equal(
silhouette_score(X, labels + 10), silhouette_score(X, labels))
silhouette_score(X, labels * 2 + 10), silhouette_score(X, labels))
assert_array_equal(
silhouette_samples(X, labels * 2 + 10), silhouette_samples(X, labels))
Copy link
8000
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great checks, thanks!



def test_non_numpy_labels():
Expand Down
25 changes: 12 additions & 13 deletions sklearn/metrics/cluster/unsupervised.py
< A3E2 /tr>
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@

from ...utils import check_random_state
from ...utils import check_X_y
from ...utils.fixes import bincount
from ..pairwise import pairwise_distances
from ...preprocessing import LabelEncoder

Expand Down Expand Up @@ -88,15 +89,8 @@ def silhouette_score(X, labels, metric='euclidean', sample_size=None,
<https://en.wikipedia.org/wiki/Silhouette_(clustering)>`_

"""
X, labels = check_X_y(X, labels, accept_sparse=['csc', 'csr'])
le = LabelEncoder()
labels = le.fit_transform(labels)
n_labels = len(le.classes_)
n_samples = X.shape[0]

check_number_of_labels(n_labels, n_samples)

if sample_size is not None:
X, labels = check_X_y(X, labels, accept_sparse=['csc', 'csr'])
random_state = check_random_state(random_state)
indices = random_state.permutation(X.shape[0])[:sample_size]
if metric == "precomputed":
Expand Down Expand Up @@ -166,36 +160,39 @@ def silhouette_samples(X, labels, metric='euclidean', **kwds):
<https://en.wikipedia.org/wiki/Silhouette_(clustering)>`_

"""
X, labels = check_X_y(X, labels, accept_sparse=['csc', 'csr'])
le = LabelEncoder()
labels = le.fit_transform(labels)
check_number_of_labels(len(le.classes_), X.shape[0])

distances = pairwise_distances(X, metric=metric, **kwds)
unique_labels = le.classes_
n_samples_per_label = bincount(labels, minlength=len(unique_labels))

# For sample i, store the mean distance of the cluster to which
# it belongs in intra_clust_dists[i]
intra_clust_dists = np.ones(distances.shape[0], dtype=distances.dtype)
intra_clust_dists = np.zeros(distances.shape[0], dtype=distances.dtype)

# For sample i, store the mean distance of the second closest
# cluster in inter_clust_dists[i]
inter_clust_dists = np.inf * intra_clust_dists
inter_clust_dists = np.inf + intra_clust_dists

for curr_label in unique_labels:
for curr_label in range(len(unique_labels)):

# Find inter_clust_dist for all samples belonging to the same
# label.
mask = labels == curr_label
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now broken if labels are not contiguous integers starting at zero, no?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it was broken previously ::/

current_distances = distances[mask]

# Leave out current sample.
n_samples_curr_lab = np.sum(mask) - 1
n_samples_curr_lab = n_samples_per_label[curr_label] - 1
if n_samples_curr_lab != 0:
intra_clust_dists[mask] = np.sum(
current_distances[:, mask], axis=1) / n_samples_curr_lab

# Now iterate over all other labels, finding the mean
# cluster distance that is closest to every sample.
for other_label in unique_labels:
for other_label in range(len(unique_labels)):
if other_label != curr_label:
other_mask = labels == other_label
other_distances = np.mean(
Expand All @@ -205,6 +202,8 @@ def silhouette_samples(X, labels, metric='euclidean', **kwds):

sil_samples = inter_clust_dists - intra_clust_dists
sil_samples /= np.maximum(intra_clust_dists, inter_clust_dists)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will still raise a division by zero warning if all samples are duplicate of one another and all belong to the same cluster. But maybe we don't care that much of such a rare edge case...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also if they belong to different cluster. Quick example.

 silhouette_samples([[1.0], [1.0], [1.0], [1.0]], [0, 0, 1, 1])

However, according to the formula, this might be expected behaviour.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If all samples belong to one cluster, silhouette_score is undefined according to the paper. All points being identical is not covered in the paper. Understandably.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough then.

# score 0 for clusters of size 1, according to the paper
sil_samples[n_samples_per_label.take(labels) == 1] = 0
return sil_samples


Expand Down
0