8000 add doc to choose number of clusters · scikit-learn/scikit-learn@a3d51f4 · GitHub
[go: up one dir, main page]

Skip to content

Commit a3d51f4

Browse files
afouchetArnaud Fouchet
authored and
Arnaud Fouchet
committed
add doc to choose number of clusters
1 parent c2a00f3 commit a3d51f4

File tree

1 file changed

+241
-0
lines changed

1 file changed

+241
-0
lines changed

doc/modules/clustering.rst

Lines changed: 241 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1300,3 +1300,244 @@ Drawbacks
13001300
concepts of clusters, such as density based clusters like those obtained
13011301
through DBSCAN.
13021302

1303+
1304+
Select number of clusters
1305+
===============================
1306+
1307+
.. figure:: ../auto_examples/cluster/images/plot_chosen_nb_cluster_comparaison.png
1308+
:target: ../auto_examples/cluster/plot_chosen_nb_cluster_comparaison.html
1309+
:align: center
1310+
:scale: 50
1311+
1312+
A comparison of algorithms to select the number of clusters in scikit-learn. The clustering algorithm used is spectral clustering
1313+
1314+
.. currentmodule:: sklearn.metrics.cluster
1315+
1316+
Many algorithms require to select the wanted number of clusters. If one does not know how many clusters he wants, there exists algorithm to find the most relevant number of clusters for its data, given the data and the clustering algorithm used.
1317+
1318+
.. _unsupervised.silhouette_coefficient:
1319+
1320< 10000 code class="diff-text syntax-highlighted-line addition">+
Silhouette Coefficient
1321+
----------------------
1322+
1323+
Presentation and usage
1324+
~~~~~~~~~~~~~~~~~~~~~~
1325+
1326+
.. _stability:
1327+
1328+
Stability
1329+
---------
1330+
1331+
Presentation
1332+
~~~~~~~~~~~~
1333+
1334+
A number of clusters is relevant if the clustering algorithm finds similar
1335+
results with small perturbations of the data. In this implementation, we use
1336+
the clustering algorithm on 2 large overlapping subset of the data. If the
1337+
number of clusters is relevant, data is both subset should be clustered in
1338+
a similar way. Let :math:`E` be the initial data, :math:`E_1` and :math:`E_2`
1339+
be subsets of :math:`E`, each containing :math:`80\%` of :math:`E`.
1340+
Let :math:`C_1` and :math:`C_2` be clusters obtained on datasets :math:`E_1`
1341+
and :math:`E_2` Let :math:`A_i` be the adjacency matrix:
1342+
1343+
.. math:: \left(A_i\right)_{ij} = \left\{
1344+
\begin{tabular}
1345+
1 & \mbox{if i and j belong to the same cluster} \\
1346+
0 & \mbox{otherwise}
1347+
\end{tabular}
1348+
\right.
1349+
1350+
On points belonging to :math:`E_1` and :math:`E_2`, the adjacency matrix
1351+
should stay the same. We compute the similarity between these two clustering
1352+
1353+
.. math:: s(C_1, C_2) = \frac{
1354+
\sum_{(x_i, x_j) \in E_1 \cap E_2}\left(A_1\right)_{ij}\left(A_2\right)_{ij}}
1355+
{\sqrt{\sum_{(x_i, x_j) \in E_1 \cap E_2}\left(\left(A_1\right)_{ij}\right)^2}
1356+
\sqrt{\sum_{(x_i, x_j) \in E_1 \cap E_2}\left(\left(A_2\right)_{ij}\right)^2}}
1357+
1358+
For all number of clusters :math:`k=2\dots k_{max}`, we perform :math:`N_{draws}` times:
1359+
1360+
- Select 2 subsets :math:`E_1` and :math:`E_2`
1361+
1362+
- Compute clustering :math:`C_1` and :math:`C_2`
1363+
1364+
- Compute similarity :math:`s(C_1, C_2)`
1365+
1366+
The chosen number of clusters is the one that has maximum average similarity
1367+
1368+
Usage
1369+
~~~~~
1370+
1371+
Given a dataset and a clustering algorithm (a :class:`ClusterMixin`),
1372+
stability returns the most stable number of clusters
1373+
1374+
>>> from sklearn.datasets import make_blobs
1375+
>>> from sklearn.cluster import KMeans
1376+
>>> from sklearn.metrics.cluster.stability import stability
1377+
>>> data, labels = make_blobs(n_samples=1000, centers=2, random_state=0)
1378+
>>> kmeans_model = KMeans()
1379+
>>> stability(data, kmeans_model, k_max=10, verbose=False)
1380+
2
1381+
1382+
Advantages
1383+
~~~~~~~~~~
1384+
1385+
- Finds a number of clusters that is truly relevant to your data and your
1386+
clustering algorithm
1387+
1388+
- The stability score, going from 0 to 1, can measure how well your data is clustered
1389+
in k groups
1390+
1391+
1392+
Drawbacks
1393+
~~~~~~~~~
1394+
1395+
- Computational time
1396+
1397+
.. topic:: References
1398+
1399+
* `"A stability based method for discovering structure in clustered data"
1400+
<http://www.researchgate.net/profile/Asa_Ben-Hur/publication/11435997_A_stability_based_method_for_discovering_structure_in_clustered_data/links/00b4953988b6d0233d000000.pdf>`_
1401+
Ben-Hur, A., Elisseeff, A., & Guyon, I., *Pacific symposium on biocomputing* (2001, December)
1402+
1403+
.. _gap_statistic:
1404+
1405+
Gap statistic
1406+
-------------
1407+
1408+
Presentation
1409+
~~~~~~~~~~~~
1410+
1411+
Gap statistic compares the inertia of our dataset :math:`E` in :math:`k` clusters
1412+
against inertia of random data in :math:`k` clusters. By default, random data is drawn from a
1413+
uniform distribution, with the same bounds as :math:`E`. Data can also be
1414+
drawn from a Gaussian distribution with same mean and variance as :math:`E`.
1415+
Let :math:`W_k` be the inertia of randomly-drawn data in k clusters,
1416+
:math:`W_k^*` be the inertia of :math:`E` in k clusters.
1417+
The gap statistic is defined as:
1418+
1419+
.. math:: Gap(k) = \mathbb{E}\left[\log(W_k)\right] - \log(W_k^*)
1420+
1421+
If we have K clusters in our data, we expect :math:`W_k^*` to increase
1422+
fast if :math:`k \leq K` and slowly for :math:`k > K`.
1423+
We estimate :math:`\mathbb{E}\left[\log(W_k)\right]` by creating :math:`B`
1424+
random datasets. Let :math:`sd_k` be the standard deviation of
1425+
:math:`\log(W_k)`. We select the smallest :math:`k` such as the gap increase is
1426+
too small after :math:`k^*`:
1427+
1428+
.. math:: k^* = \mbox{smallest k such as} \;
1429+
Gap(k) \geq Gap(k+1) - \frac{sd_{k+1}}{\sqrt{1 + 1/B}}
1430+
1431+
Usage
1432+
~~~~~
1433+
1434+
Given a dataset and a clustering algorithm (a :class:`ClusterMixin`),
1435+
stability returns the most stable number of clusters
1436+
1437+
>>> from sklearn.datasets import make_blobs
1438+
>>> from sklearn.cluster import KMeans
1439+
>>> from sklearn.metrics.cluster.gap_statistic import gap_statistic
1440+
>>> data, labels = make_blobs(n_samples=1000, centers=4, random_state=0)
1441+
>>> kmeans_model = KMeans()
1442+
>>> gap_statistic(data, kmeans_model, k_max=10)
1443+
4
1444+
1445+
.. topic:: References
1446+
1447+
* `"Estimating the number of clusters in a data set via the gap statistic"
1448+
<http://web.stanford.edu/~hastie/Papers/gap.pdf>`_
1449+
Tibshirani, R., Walther, G., & Hastie, T., *Journal of the Royal Statistical Society: Series B* (Statistical Methodology), (2001)
1450+
1451+
.. _calinski_harabaz_index:
1452+
1453+
Calinski-Harabaz index
1454+
----------------------
1455+
1456+
Presentation
1457+
~~~~~~~~~~~~
1458+
1459+
The goal of the Calinski-Harabaz index is to maximize dispersion between clusters
1460+
and minimize dispersion within clusters. Let
1461+
1462+
- :math:`N` be the number of points in our data,
1463+
- :math:`C_q` be the set of points in cluster :math:`q`,
1464+
- :math:`c_q` be the center of cluster :math:`q`,
1465+
- :math:`c` be the center of :math:`E`,
1466+
- :math:`n_q` be the number of points in cluster :math:`q`:
1467+
1468+
The Calinski-Harabaz index for data in :math:`k` cluster, noted
1469+
:math:`CH(k)`, is defined as:
1470+
1471+
.. math::
1472+
1473+
CH(k) = \frac{trace(B_k)}{trace(W_k)} \times \frac{N - k}{k - 1}
1474+
1475+
with
1476+
1477+
.. math::
1478+
W_k = \sum_{q=1}^k \sum_{x \in C_q} (x - c_q) (x - c_q)^T \\
1479+
B_k = \sum_q n_q (c_q - c) (c_q -c)^T \\
1480+
1481+
Advantages
1482+
~~~~~~~~~~
1483+
1484+
- The score is higher when clusters are dense and well separated, which relates
1485+
to a standard concept of a cluster.
1486+
1487+
- Fast computation
1488+
1489+
1490+
Drawbacks
1491+
~~~~~~~~~
1492+
1493+
- The Calinski-Harabaz index is generally higher for convex clusters than other
1494+
concepts of clusters, such as density based clusters like those obtained
1495+
through DBSCAN.
1496+
1497+
.. topic:: References
1498+
1499+
* "A dendrite method for cluster analysis"
1500+
Caliński, T., & Harabasz, J., *Communications in Statistics-theory and Methods*, (1974)
1501+
1502+
.. _distortion_jump:
1503+
1504+
Distortion jump
1505+
---------------
1506+
1507+
Presentation
1508+
~~~~~~~~~~~~
1509+
1510+
Distortion jump aims to maximize efficiency (using the smallest number of clusters)
1511+
while minimizing error by information theoretic standards (here, the error is the
1512+
variance of data points in cluster). The data :math:`E` consists of
1513+
:math:`N` points of :math:`d` dimensions. The average
1514+
distortion is:
1515+
1516+
.. math:: W_k = \frac{1}{d}\sum_{q=1}^k \sum_{x \in C_q} (x-c_q)^T (x-c_q)
1517+
1518+
1519+
with :math:`C_q` the set of points in cluster :math:`q` and
1520+
:math:`c_q` the center of cluster :math:`q`. :math:`k^*`, the chosen number of cluster, is the one that maximized our gain in information. Let :math:`y = d / 2`.
1521+
1522+
.. math:: k^$ = \arg\min_{k=2\dots k_{max}}W_k^{-y} - W_{k-1}^{-y}
1523+
1524+
The choice of the transform power :math:`Y = (d/2)` is motivated by asymptotic reasoning using results from rate distortion theory.
1525+
1526+
1527+
.. topic:: References
1528+
1529+
* `"Distortion Jump"
1530+
<en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set>`_
1531+
1532+
1533+
Advantages
1534+
~~~~~~~~~~
1535+
1536+
- Fast computation
1537+
1538+
Drawbacks
1539+
~~~~~~~~~
1540+
1541+
- The distortion jump works better for convex clusters than other
1542+
concepts of clusters, such as density based clusters like those obtained
1543+
through DBSCAN.

0 commit comments

Comments
 (0)
0