8000 [MRG] Introduction of some new cluster metrics. · scikit-learn/scikit-learn@8750c9c · GitHub
[go: up one dir, main page]

Skip to content
8000

Commit 8750c9c

Browse files
committed
[MRG] Introduction of some new cluster metrics.
This work is based on the code of A. Fourchet proposes in the PR-4301. - Speedup the computation. - Simplify the doc. - Correct some bugs. - Add test.
1 parent 3da1c49 commit 8750c9c

23 files changed

+390
-964
lines changed

doc/modules/clustering.rst

Lines changed: 152 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1339,6 +1339,91 @@ mean of homogeneity and completeness**:
13391339
<http://www.cs.columbia.edu/~hila/hila-thesis-distributed.pdf>`_, Hila
13401340
Becker, PhD Thesis.
13411341
1342+
.. _fowlkes_mallows_scores:
1343+
1344+
Fowlkes-Mallows scores
1345+
----------------------
1346+
1347+
The Fowlkes-Mallows index (FMI) is defined as the geometric mean between of
1348+
the precision and recall::
1349+
1350+
FMI = TP / sqrt((TP + FP) * (TP + FN))
1351+
1352+
Where :math:`TP` is the number of **True Positive** (i.e. the number of pair
1353+
of points that belongs in the same clusters in both labels_true and
1354+
labels_pred), :math:`FP` is the number of **False Positive** (i.e. the number
1355+
of pair of points that belongs in the same clusters in labels_true and not
1356+
in labels_pred) and :math:`FN`is the number of **False Negative** (i.e the
1357+
number of pair of points that belongs in the same clusters in labels_pred
1358+
and not in labels_True).
1359+
1360+
The score ranges from 0 to 1. A high value indicates a good similarity
1361+
between two clusters.
1362+
1363+
>>> from sklearn import metrics
1364+
>>> labels_true = [0, 0, 0, 1, 1, 1]
1365+
>>> labels_pred = [0, 0, 1, 1, 2, 2]
1366+
1367+
>>> metrics.fowlkes_mallows_score(labels_true, labels_pred) # doctest: +ELLIPSIS
1368+
0.47140...
1369+
1370+
One can permute 0 and 1 in the predicted labels, rename 2 to 3 and get
1371+
the same score::
1372+
1373+
>>> labels_pred = [1, 1, 0, 0, 3, 3]
1374+
1375+
>>> metrics.fowlkes_mallows_score(labels_true, labels_pred) # doctest: +ELLIPSIS
1376+
0.47140...
1377+
1378+
Perfect labeling is scored 1.0::
1379+
1380+
>>> labels_pred = labels_true[:]
1381+
>>> metrics.fowlkes_mallows_score(labels_true, labels_pred) # doctest: +ELLIPSIS
1382+
1.0
1383+
1384+
Bad (e.g. independent labelings) have zero scores::
1385+
1386+
>>> labels_true = [0, 1, 2, 0, 3, 4, 5, 1]
1387+
>>> labels_pred = [1, 1, 0, 0, 2, 2, 2, 2]
1388+
>>> metrics.fowlkes_mallows_score(labels_true, labels_pred) # doctest: +ELLIPSIS
1389+
0.0
1390+
1391+
Advantages
1392+
~~~~~~~~~~
1393+
1394+
- **Random (uniform) label assignments have a FMI score close to 0.0**
1395+
for any value of ``n_clusters`` and ``n_samples`` (which is not the
1396+
case for raw Mutual Information or the V-measure for instance).
1397+
1398+
- **Bounded range [0, 1]**: Values close to zero indicate two label
1399+
assignments that are largely independent, while values close to one
1400+
indicate significant agreement. Further, values of exactly 0 indicate
1401+
**purely** independent label assignments and a AMI of exactly 1 indicates
1402+
that the two label assignments are equal (with or without permutation).
1403+
1404+
- **No assumption is made on the cluster structure**: can be used
1405+
to compare clustering algorithms such as k-means which assumes isotropic
1406+
blob shapes with results of spectral clustering algorithms which can
1407+
find cluster with "folded" shapes.
1408+
1409+
1410+
Drawbacks
1411+
~~~~~~~~~
1412+
1413+
- Contrary to inertia, **FMI-based measures require the knowledge
1414+
of the ground truth classes** while almost never available in practice or
1415+
requires manual assignment by human annotators (as in the supervised learning
1416+
setting).
1417+
1418+
.. topic:: References
1419+
1420+
* E. B. Fowkles and C. L. Mallows, 1983. "A method for comparing two
1421+
hierarchical clusterings". Journal of the American Statistical Association.
1422+
http://wildfire.stat.ucla.edu/pdflibrary/fowlkes.pdf
1423+
1424+
* `Wikipedia entry for the Fowlkes-Mallows Index
1425+
<https://en.wikipedia.org/wiki/Fowlkes-Mallows_index>`_
1426+
13421427
.. _silhouette_coefficient:
13431428

13441429
Silhouette Coefficient
@@ -1413,3 +1498,70 @@ Drawbacks
14131498

14141499
* :ref:`example_cluster_plot_kmeans_silhouette_analysis.py` : In this example
14151500
the silhouette analysis is used to choose an optimal value for n_clusters.
1501+
1502+
.. _calinski_harabaz_index:
1503+
1504+
Calinski-Harabaz Index
1505+
----------------------
1506+
1507+
If the ground truth labels are not known, the Calinski-Harabaz index
1508+
(:func:'sklearn.metrics.calinski_harabaz_score') can be used to evaluate the
1509+
model, where a higher Calinski-Harabaz score relates to a model with better
1510+
defined clusters.
1511+
1512+
For :math:`k` clusters, the Calinski-Harabaz :math:`ch` is given as the ratio
1513+
of the between-clusters dispersion mean and the within-cluster dispersion:
1514+
1515+
.. math::
1516+
ch(k) = \frac{trace(B_k)}{trace(W_k)} \times \frac{N - k}{k - 1}
1517+
W_k = \sum_{q=1}^k \sum_{x \in C_q} (x - c_q) (x - c_q)^T \\
1518+
B_k = \sum_q n_q (c_q - c) (c_q - c)^T \\
1519+
1520+
where:
1521+
- :math:`N` be the number of points in our data,
1522+
- :math:`C_q` be the set of points in cluster :math:`q`,
1523+
- :math:`c_q` be the center of cluster :math:`q`,
1524+
- :math:`c` be the center of :math:`E`,
1525+
- :math:`n_q` be the number of points in cluster :math:`q`:
1526+
1527+
1528+
>>> from sklearn import metrics
1529+
>>> from sklearn.metrics import pairwise_distances
1530+
>>> from sklearn import datasets
1531+
>>> dataset = datasets.load_iris()
1532+
>>> X = dataset.data
1533+
>>> y = dataset.target
1534+
1535+
In normal usage, the Calinski-Harabaz index is applied to the results of a
1536+
cluster analysis.
1537+
1538+
>>> import numpy as np
1539+
>>> from sklearn.cluster import KMeans
1540+
>>> kmeans_model = KMeans(n_clusters=3, random_state=1).fit(X)
1541+
>>> labels = kmeans_model.labels_
1542+
>>> metrics.calinski_harabaz_score(X, labels)
1543+
... # doctest: +ELLIPSIS
1544+
560.39...
1545+
1546+
1547+
Advantages
1548+
~~~~~~~~~~
1549+
1550+
- The score is higher when clusters are dense and well separated, which relates
1551+
to a standard concept of a cluster.
1552+
1553+
- The score is fast to compute
1554+
1555+
1556+
Drawbacks
1557+
~~~~~~~~~
1558+
1559+
- The Calinski-Harabaz index is generally higher for convex clusters than other
1560+
concepts of clusters, such as density based clusters like those obtained
1561+
through DBSCAN.
1562+
1563+
.. topic:: References
1564+
1565+
* Caliński, T., & Harabasz, J. (1974). "A dendrite method for cluster
1566+
analysis". Communications in Statistics-theory and Methods 3: 1-27.
1567+
`doi:10.1080/03610926.2011.560741 <http://dx.doi.org/10.1080/03610926.2011.560741>`_.

doc/whats_new.rst

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,14 @@ New features
5858
One can pass method names such as `predict_proba` to be used in the cross
5959
validation framework instead of the default `predict`. By `Ori Ziv`_ and `Sears Merritt`_.
6060

61+
- Added :func:`metrics.cluster.fowlkes_mallows_score`, the Fowlkes Mallows
62+
Index which measures the similarity of two clusterings of a set of points
63+
By `Thierry Guillemot`_.
64+
65+
- Added :func:`metrics.calinski_harabaz_score`, which compute the Calinski
66+
and Harabaz score to evaluate the resulting clustering of a set of points.
67+
By `Thierry Guillemot`_.
68+
6169
Enhancements
6270
............
6371

examples/cluster/plot_choose_nb_cluster.py

Lines changed: 0 additions & 89 deletions
This file was deleted.

sklearn/metrics/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,8 +39,10 @@
3939
from .cluster import homogeneity_score
4040
from .cluster import mutual_info_score
4141
from .cluster import normalized_mutual_info_score
42+
from .cluster import fowlkes_mallows_score
4243
from .cluster import silhouette_samples
4344
from .cluster import silhouette_score
45+
from .cluster import calinski_harabaz_score
4446
from .cluster import v_measure_score
4547

4648
from .pairwise import euclidean_distances

sklearn/metrics/cluster/__init__.py

Lines changed: 4 additions & 2 deletions
< 7802 col width="40"/>
Original file line numberDiff line numberDiff line change
@@ -15,14 +15,16 @@
1515
from .supervised import homogeneity_score
1616
from .supervised import mutual_info_score
1717
from .supervised import v_measure_score
18+
from .supervised import fowlkes_mallows_score
1819
from .supervised import entropy
1920
from .unsupervised import silhouette_samples
2021
from .unsupervised import silhouette_score
22+
from .unsupervised import calinski_harabaz_score
2123
from .bicluster import consensus_score
2224

2325
__all__ = ["adjusted_mutual_info_score", "normalized_mutual_info_score",
2426
"adjusted_rand_score", "completeness_score", "contingency_matrix",
2527
"expected_mutual_information", "homogeneity_completeness_v_measure",
2628
"homogeneity_score", "mutual_info_score", "v_measure_score",
27-
"entropy", "silhouette_samples", "silhouette_score",
28-
"consensus_score"]
29+
"fowlkes_mallows_score", "entropy", "silhouette_samples",
30+
"silhouette_score", "calinski_harabaz_score", "consensus_score"]

sklearn/metrics/cluster/adjacency_matrix.py

Lines changed: 0 additions & 22 deletions
This file was deleted.

0 commit comments

Comments
 (0)
0