8000 Remove inertia and distortion and update doc. · scikit-learn/scikit-learn@265fc71 · GitHub
[go: up one dir, main page]

Skip to content

Commit 265fc71

Browse files
committed
Remove inertia and distortion and update doc.
1 parent 5eb32bf commit 265fc71

File tree

5 files changed

+141
-145
lines changed

5 files changed

+141
-145
lines changed

doc/modules/clustering.rst

Lines changed: 138 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1339,6 +1339,75 @@ mean of homogeneity and completeness**:
13391339
<http://www.cs.columbia.edu/~hila/hila-thesis-distributed.pdf>`_, Hila
13401340
Becker, PhD Thesis.
13411341
1342+
.. _fowlkes_mallows_scores:
1343+
1344+
Fowlkes-Mallows scores
1345+
----------------------
1346+
1347+
>>> from sklearn import metrics
1348+
>>> labels_true = [0, 0, 0, 1, 1, 1]
1349+
>>> labels_pred = [0, 0, 1, 1, 2, 2]
1350+
1351+
>>> metrics.fowlkes_mallows_score(labels_true, labels_pred) # doctest: +ELLIPSIS
1352+
0.47140...
1353+
1354+
One can permute 0 and 1 in the predicted labels, rename 2 to 3 and get
1355+
the same score::
1356+
1357+
>>> labels_pred = [1, 1, 0, 0, 3, 3]
1358+
1359+
>>> metrics.fowlkes_mallows_score(labels_true, labels_pred) # doctest: +ELLIPSIS
1360+
0.47140...
1361+
1362+
Perfect labeling is scored 1.0::
1363+
1364+
>>> labels_pred = labels_true[:]
1365+
>>> metrics.adjusted_mutual_info_score(labels_true, labels_pred) # doctest: +ELLIPSIS
1366+
1.0
1367+
1368+
Bad (e.g. independent labelings) have zero scores::
1369+
1370+
>>> labels_true = [0, 1, 2, 0, 3, 4, 5, 1]
1371+
>>> labels_pred = [1, 1, 0, 0, 2, 2, 2, 2]
1372+
>>> metrics.fowlkes_mallows_score(labels_true, labels_pred) # doctest: +ELLIPSIS
1373+
0.0
1374+
1375+
Advantages
1376+
~~~~~~~~~~
1377+
1378+
- **Random (uniform) label assignments have a FMI score close to 0.0**
1379+
for any value of ``n_clusters`` and ``n_samples`` (which is not the
1380+
case for raw Mutual Information or the V-measure for instance).
1381+
1382+
- **Bounded range [0, 1]**: Values close to zero indicate two label
1383+
assignments that are largely independent, while values close to one
1384+
indicate significant agreement. Further, values of exactly 0 indicate
1385+
**purely** independent label assignments and a AMI of exactly 1 indicates
1386+
that the two label assignments are equal (with or without permutation).
1387+
1388+
- **No assumption is made on the cluster structure**: can be used
1389+
to compare clustering algorithms such as k-means which assumes isotropic
1390+
blob shapes with results of spectral clustering algorithms which can
1391+
find cluster with "folded" shapes.
1392+
1393+
1394+
Drawbacks
1395+
~~~~~~~~~
1396+
1397+
- Contrary to inertia, **FMI-based measures require the knowledge
1398+
of the ground truth classes** while almost never available in practice or
1399+
requires manual assignment by human annotators (as in the supervised learning
1400+
setting).
1401+
1402+
.. topic:: References
1403+
1404+
* E. B. Fowkles and C. L. Mallows, 1983. "A method for comparing two
1405+
hierarchical clusterings". Journal of the American Statistical Association.
1406+
http://wildfire.stat.ucla.edu/pdflibrary/fowlkes.pdf
1407+
1408+
* `Wikipedia entry for the Fowlkes-Mallows Index
1409+
<https://en.wikipedia.org/wiki/Fowlkes-Mallows_index>`_
1410+
13421411
.. _silhouette_coefficient:
13431412

13441413
Silhouette Coefficient
@@ -1413,3 +1482,72 @@ Drawbacks
14131482

14141483
* :ref:`example_cluster_plot_kmeans_silhouette_analysis.py` : In this example
14151484
the silhouette analysis is used to choose an optimal value for n_clusters.
1485+
1486+
.. _calinski_harabaz_index:
1487+
1488+
Calinski-Harabaz Index
1489+
----------------------
1490+
1491+
If the ground truth labels are not known, the Calinski-Harabaz index
1492+
(:func:'sklearn.metrics.calinski_harabaz_score') can be used to evaluate the
1493+
model., where a higher Calinski-Harabaz score relates to a model with better
1494+
defined clusters.
1495+
1496+
For :math:`k` clusters, the Calinski-Harabaz :math:`ch` is given as the ratio
1497+
of the between-clusters dispersion mean and the within-cluster dispersion:
1498+
1499+
.. math::
1500+
ch(k) = \frac{trace(B_k)}{trace(W_k)} \times \frac{N - k}{k - 1}
1501+
W_k = \sum_{q=1}^k \sum_{x \in C_q} (x - c_q) (x - c_q)^T \\
1502+
B_k = \sum_q n_q (c_q - c) (c_q -c)^T \\
1503+
1504+
where:
1505+
- :math:`N` be the number of points in our data,
1506+
- :math:`C_q` be the set of points in cluster :math:`q`,
1507+
- :math:`c_q` be the center of cluster :math:`q`,
1508+
- :math:`c` be the center of :math:`E`,
1509+
- :math:`n_q` be the number of points in cluster :math:`q`:
1510+
1511+
1512+
>>> from sklearn import metrics
1513+
>>> from sklearn.metrics import pairwise_distances
1514+
>>> from sklearn import datasets
1515+
>>> dataset = datasets.load_iris()
1516+
>>> X = dataset.data
1517+
>>> y = dataset.target
1518+
1519+
In normal usage, the Calinski-Harabaz index is applied to the results of a
1520+
cluster analysis.
1521+
1522+
>>> import numpy as np
1523+
>>> from sklearn.cluster import KMeans
1524+
>>> kmeans_model = KMeans(n_clusters=3, random_state=1).fit(X)
1525+
>>> labels = kmeans_model.labels_
1526+
>>> metrics.calinski_harabaz_score(X, labels)
1527+
... # doctest: +ELLIPSIS
1528+
560.39...
1529+
1530+
1531+
.. topic:: References
1532+
1533+
* Caliński, T., & Harabasz, J. (1974). "A dendrite method for cluster
1534+
analysis". Communications in Statistics-theory and Methods 3: 1-27.
1535+
`doi:10.1080/03610926.2011.560741 <http://dx.doi.org/10.1080/03610926.2011.560741>`_.
1536+
1537+
1538+
Advantages
1539+
~~~~~~~~~~
1540+
1541+
- The score is higher when clusters are dense and well separated, which relates
1542+
to a standard concept of a cluster.
1543+
1544+
- The score is fast to compute
1545+
1546+
1547+
Drawbacks
1548+
~~~~~~~~~
1549+
1550+
- The Calinski-Harabaz index is generally higher for convex clusters than other
1551+
concepts of clusters, such as density based clusters like those obtained
1552+
through DBSCAN.
1553+

sklearn/metrics/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,8 +39,10 @@
3939
from .cluster import homogeneity_score
4040
from .cluster import mutual_info_score
4141
from .cluster import normalized_mutual_info_score
42+
from .cluster import fowlkes_mallows_score
4243
from .cluster import silhouette_samples
4344
from .cluster import silhouette_score
45+
from .cluster import calinski_harabaz_score
4446
from .cluster import v_measure_score
4547

4648
from .pairwise import euclidean_distances

sklearn/metrics/cluster/__init__.py

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -20,14 +20,11 @@
2020
from .unsupervised import silhouette_samples
2121
from .unsupervised import silhouette_score
2222
from .unsupervised import calinski_harabaz_score
23-
from .unsupervised import distortion_score
24-
from .unsupervised import inertia_score
2523
from .bicluster import consensus_score
2624

2725
__all__ = ["adjusted_mutual_info_score", "normalized_mutual_info_score",
2826
"adjusted_rand_score", "completeness_score", "contingency_matrix",
2927
"expected_mutual_information", "homogeneity_completeness_v_measure",
3028
"homogeneity_score", "mutual_info_score", "v_measure_score",
3129
"fowlkes_mallows_score", "entropy", "silhouette_samples",
32-
"silhouette_score", "calinski_harabaz_score", "distortion_score",
33-
"inertia_score", "consensus_score"]
30+
"silhouette_score", "calinski_harabaz_score", "consensus_score"]

sklearn/metrics/cluster/tests/test_unsupervised.py

Lines changed: 0 additions & 40 deletions
150
Original file line numberDiff line numberDiff line change
@@ -9,8 +9,6 @@
99
from sklearn.utils.testing import assert_raise_message
1010
from sklearn.metrics.cluster import silhouette_score
1111
from sklearn.metrics.cluster import calinski_harabaz_score
12-
from sklearn.metrics.cluster import distortion_score
13-
from sklearn.metrics.cluster import inertia_score
1412
from sklearn.metrics import pairwise_distances
1513

1614

@@ -119,41 +117,3 @@ def test_calinski_harabaz_score():
119117
labels = [0] * 10 + [1] * 10 + [2] * 10 + [3] * 10
120118
assert_almost_equal(calinski_harabaz_score(X, labels),
121119
45 * (40 - 4) / (5 * (4 - 1)))
122-
123-
124-
def test_distortion_score():
125-
rng = np.random.RandomState(seed=0)
126-
127-
# Assert message when there is only one label
128-
assert_raise_message(ValueError, "Number of labels is",
129-
distortion_score,
130-
rng.rand(10, 2), np.zeros(10))
131-
132-
# Assert message when all point are in different clusters
133-
assert_raise_message(ValueError, "Number of labels is",
134-
distortion_score,
135-
rng.rand(10, 2), np.arange(10))
136-
137-
X = np.array([[0, 0], [2, 2],
138-
[5, 5], [6, 6]])
139-
labels = [0, 0, 1, 1]
140-
assert_almost_equal(distortion_score(X, labels), 1.5 * np.sqrt(2.))
141-
142-
143-
def test_inertia_score():
144-
rng = np.random.RandomState(seed=0)
145-
146-
# Assert message when there is only one label
147-
assert_raise_message(ValueError, "Number of labels is",
148-
inertia_score,
149-
rng.rand(10, 2), np.zeros(10))
-
151-
# Assert message when all point are in different clusters
152-
assert_raise_message(ValueError, "Number of labels is",
153-
inertia_score,
154-
rng.rand(10, 2), np.arange(10))
155-
156-
X = np.array([[0, 0], [2, 2],
157-
[5, 5], [6, 6]])
158-
labels = [0, 0, 1, 1]
159-
assert_almost_equal(inertia_score(X, labels), 1.5 / np.sqrt(2.))

sklearn/metrics/cluster/unsupervised.py

Lines changed: 0 additions & 101 deletions
Original file line numberDiff line numberDiff line change
@@ -222,8 +222,6 @@ def calinski_harabaz_score(X, labels):
222222
B_K = \sum_k n_k (c_k - c) (c_k -c)^T
223223
W_K = \sum_k \sum_{x \in C_k} (x - c_k) (x - c_k)^T
224224
225-
The score ranges from 0 to 1.
226-
227225
Parameter
228226
---------
229227
X : array-like, shape (n_samples, n_features)
@@ -264,102 +262,3 @@ def calinski_harabaz_score(X, labels):
264262
return (1. if intra_disp == 0. else
265263
extra_disp * (n_samples - n_labels) /
266264
(intra_disp * (n_labels - 1.)))
267-
268-
269-
def distortion_score(X, labels, metric="euclidean", **kwds):
270-
"""Compute the distortion of a given dataset and their cluster assignment.
271-
272-
Parameters
273-
----------
274-
X : array-like, shape (n_samples, n_features)
275-
List of n_features-dimensional data points. Each row corresponds
276-
to a single data point.
277-
278-
labels : array-like, shape (n_samples,)
279-
Predicted labels for each sample.
280-
281-
metric : string, or callable
282-
The metric to use when calculating distance between instances in a
283-
feature array. If metric is a string, it must be one of the options
284-
allowed by :func:`sklearn.metrics.pairwise.pairwise_distances`.
285-
286-
**kwds : optional keyword parameters
287-
Any further parameters are passed directly to the distance function.
288-
If using a scipy.spatial.distance metric, the parameters are still
289-
metric dependent. See the scipy docs for usage examples.
290-
291-
Returns
292-
-------
293-
score: float
294-
The resulting distortion value.
295-
"""
296-
X, labels = check_X_y(X, labels)
297-
le = LabelEncoder()
298-
labels = le.fit_transform(labels)
299-
300-
n_samples, n_features = X.shape
301-
n_labels = len(le.classes_)
302-
303-
check_number_of_labels(n_labels, n_samples)
304-
305-
dist = 0.
306-
for k in range(n_labels):
307-
cluster_k = X[labels == k]
308-
mean_k = np.mean(cluster_k, axis=0)
309-
dist += np.sum(pairwise_distances(cluster_k,
310-
mean_k[:, np.newaxis].reshape(1, -1),
311-
metric=metric, **kwds))
312-
313-
return dist / n_features
314-
315-
316-
def inertia_score(X, labels, metric="euclidean", **kwds):
317-
"""Compute the inertia of a given dataset and their cluster assignment.
318-
319-
The inertia is defined as the sum of distances of samples to their closest
320-
cluster center.
321-
322-
Parameters
323-
----------
324-
X : array-like, shape (n_samples, n_features)
325-
List of n_features-dimensional data points. Each row corresponds
326-
to a single data point.
327-
328-
labels : array-like, shape (n_samples,)
329-
Predicted labels for each sample.
330-
331-
metric : string, or callable
332-
The metric to use when calculating distance between instances in a
333-
feature array. If metric is a string, it must be one of the options
334-
allowed by :func:`sklearn.metrics.pairwise.pairwise_distances`.
335-
336-
**kwds : optional keyword parameters
337-
Any further parameters are passed directly to the distance function.
338-
If using a scipy.spatial.distance metric, the parameters are still
339-
metric dependent. See the scipy docs for usage examples.
340-
341-
Returns
342-
-------
343-
score: float
344-
The resulting inertia value.
345-
"""
346-
X, labels = check_X_y(X, labels)
347-
le = LabelEncoder()
348-
labels = le.fit_transform(labels)
349-
350-
n_samples, n_features = X.shape
351-
n_labels = len(le.classes_)
352-
353-
if not 1 < n_labels < n_samples:
354-
raise ValueError("Number of labels is %d. Valid values are 2 "
355-
"to n_samples - 1 (inclusive)" % n_labels)
356-
357-
inertia = 0.
358-
for k in range(n_labels):
359-
cluster_k = X[labels == k]
360-
mean_k = np.mean(cluster_k, axis=0)
361-
inertia += (np.sum(pairwise_distances(
362-
cluster_k, mean_k[:, np.newaxis].reshape(1, -1),
363-
metric=metric, **kwds)) / (2 * len(cluster_k)))
364-
365-
return inertia

0 commit comments

Comments
 (0)
0