@@ -1339,6 +1339,91 @@ mean of homogeneity and completeness**:
1339
1339
<http://www.cs.columbia.edu/~hila/hila-thesis-distributed.pdf> `_, Hila
1340
1340
Becker, PhD Thesis.
1341
1341
1342
+ .. _fowlkes_mallows_scores :
1343
+
1344
+ Fowlkes-Mallows scores
1345
+ ----------------------
1346
+
1347
+ The Fowlkes-Mallows index (FMI) is defined as the geometric mean between of
1348
+ the precision and recall::
1349
+
1350
+ FMI = TP / sqrt((TP + FP) * (TP + FN))
1351
+
1352
+ Where :math: `TP` is the number of **True Positive ** (i.e. the number of pair
1353
+ of points that belongs in the same clusters in both labels_true and
1354
+ labels_pred), :math: `FP` is the number of **False Positive ** (i.e. the number
1355
+ of pair of points that belongs in the same clusters in labels_true and not
1356
+ in labels_pred) and :math: `FN`is the number of **False Negative** (i.e the
1357
+ number of pair of points that belongs in the same clusters in labels_pred
1358
+ and not in labels_True).
1359
+
1360
+ The score ranges from 0 to 1 . A high value indicates a good similarity
1361
+ between two clusters.
1362
+
1363
+ >>> from sklearn import metrics
1364
+ >>> labels_true = [0 , 0 , 0 , 1 , 1 , 1 ]
1365
+ >>> labels_pred = [0 , 0 , 1 , 1 , 2 , 2 ]
1366
+
1367
+ >>> metrics.fowlkes_mallows_score(labels_true, labels_pred) # doctest: +ELLIPSIS
1368
+ 0.47140 ...
1369
+
1370
+ One can permute 0 and 1 in the predicted labels, rename 2 to 3 and get
1371
+ the same score::
1372
+
1373
+ >>> labels_pred = [1 , 1 , 0 , 0 , 3 , 3 ]
1374
+
1375
+ >>> metrics.fowlkes_mallows_score(labels_true, labels_pred) # doctest: +ELLIPSIS
1376
+ 0.47140 ...
1377
+
1378
+ Perfect labeling is scored 1.0 ::
1379
+
1380
+ >>> labels_pred = labels_true[:]
1381
+ >>> metrics.fowlkes_mallows_score(labels_true, labels_pred) # doctest: +ELLIPSIS
1382
+ 1.0
1383
+
1384
+ Bad (e.g. independent labelings) have zero scores::
1385
+
1386
+ >>> labels_true = [0 , 1 , 2 , 0 , 3 , 4 , 5 , 1 ]
1387
+ >>> labels_pred = [1 , 1 , 0 , 0 , 2 , 2 , 2 , 2 ]
1388
+ >>> metrics.fowlkes_mallows_score(labels_true, labels_pred) # doctest: +ELLIPSIS
1389
+ 0.0
1390
+
1391
+ Advantages
1392
+ ~~~~~~~~~~
1393
+
1394
+ - **Random (uniform) label assignments have a FMI score close to 0.0 **
1395
+ for any value of ``n_clusters`` and ``n_samples `` (which is not the
1396
+ case for raw Mutual Information or the V-measure for instance).
1397
+
1398
+ - **Bounded range [0, 1] **: Values close to zero indicate two label
1399
+ assignments that are largely independent, while values close to one
1400
+ indicate significant agreement. Further, values of exactly 0 indicate
1401
+ **purely ** independent label assignments and a AMI of exactly 1 indicates
1402
+ that the two label assignments are equal (with or without permutation).
1403
+
1404
+ - **No assumption is made on the cluster structure **: can be used
1405
+ to compare clustering algorithms such as k-means which assumes isotropic
1406
+ blob shapes with results of spectral clustering algorithms which can
1407
+ find cluster with "folded" shapes.
1408
+
1409
+
1410
+ Drawbacks
1411
+ ~~~~~~~~~
1412
+
1413
+ - Contrary to inertia, **FMI-based measures require the knowledge
1414
+ of the ground truth classes ** while almost never available in practice or
1415
+ requires manual assignment by human annotators (as in the supervised learning
1416
+ setting).
1417
+
1418
+ .. topic :: References
1419
+
1420
+ * E. B. Fowkles and C. L. Mallows, 1983. "A method for comparing two
1421
+ hierarchical clusterings". Journal of the American Statistical Association.
1422
+ http://wildfire.stat.ucla.edu/pdflibrary/fowlkes.pdf
1423
+
1424
+ * `Wikipedia entry for the Fowlkes-Mallows Index
1425
+ <https://en.wikipedia.org/wiki/Fowlkes-Mallows_index> `_
1426
+
1342
1427
.. _silhouette_coefficient :
1343
1428
1344
1429
Silhouette Coefficient
@@ -1413,3 +1498,70 @@ Drawbacks
1413
1498
1414
1499
* :ref: `example_cluster_plot_kmeans_silhouette_analysis.py ` : In this example
1415
1500
the silhouette analysis is used to choose an optimal value for n_clusters.
1501
+
1502
+ .. _calinski_harabaz_index :
1503
+
1504
+ Calinski-Harabaz Index
1505
+ ----------------------
1506
+
1507
+ If the ground truth labels are not known, the Calinski-Harabaz index
1508
+ (:func: 'sklearn.metrics.calinski_harabaz_score') can be used to evaluate the
1509
+ model, where a higher Calinski-Harabaz score relates to a model with better
1510
+ defined clusters.
1511
+
1512
+ For :math: `k` clusters, the Calinski-Harabaz :math: `ch` is given as the ratio
1513
+ of the between-clusters dispersion mean and the within-cluster dispersion:
1514
+
1515
+ .. math ::
1516
+ ch(k) = \frac {trace(B_k)}{trace(W_k)} \times \frac {N - k}{k - 1 }
1517
+ W_k = \sum _{q=1 }^k \sum _{x \in C_q} (x - c_q) (x - c_q)^T \\
1518
+ B_k = \sum _q n_q (c_q - c) (c_q - c)^T \\
1519
+
1520
+ where:
1521
+ - :math: `N` be the number of points in our data,
1522
+ - :math: `C_q` be the set of points in cluster :math: `q`,
1523
+ - :math: `c_q` be the center of cluster :math: `q`,
1524
+ - :math: `c` be the center of :math: `E`,
1525
+ - :math: `n_q` be the number of points in cluster :math: `q`:
1526
+
1527
+
1528
+ >>> from sklearn import metrics
1529
+ >>> from sklearn.metrics import pairwise_distances
1530
+ >>> from sklearn import datasets
1531
+ >>> dataset = datasets.load_iris()
1532
+ >>> X = dataset.data
1533
+ >>> y = dataset.target
1534
+
1535
+ In normal usage, the Calinski-Harabaz index is applied to the results of a
1536
+ cluster analysis.
1537
+
1538
+ >>> import numpy as np
1539
+ >>> from sklearn.cluster import KMeans
1540
+ >>> kmeans_model = KMeans(n_clusters = 3 , random_state = 1 ).fit(X)
1541
+ >>> labels = kmeans_model.labels_
1542
+ >>> metrics.calinski_harabaz_score(X, labels)
1543
+ ... # doctest: +ELLIPSIS
1544
+ 560.39...
1545
+
1546
+
1547
+ Advantages
1548
+ ~~~~~~~~~~
1549
+
1550
+ - The score is higher when clusters are dense and well separated, which relates
1551
+ to a standard concept of a cluster.
1552
+
1553
+ - The score is fast to compute
1554
+
1555
+
1556
+ Drawbacks
1557
+ ~~~~~~~~~
1558
+
1559
+ - The Calinski-Harabaz index is generally higher for convex clusters than other
1560
+ concepts of clusters, such as density based clusters like those obtained
1561
+ through DBSCAN.
1562
+
1563
+ .. topic :: References
1564
+
1565
+ * Caliński, T., & Harabasz, J. (1974). "A dendrite method for cluster
1566
+ analysis". Communications in Statistics-theory and Methods 3: 1-27.
1567
+ `doi:10.1080/03610926.2011.560741 <http://dx.doi.org/10.1080/03610926.2011.560741 >`_.
0 commit comments