8000 Docstring fix. · scikit-learn/scikit-learn@6f1d7e8 · GitHub
[go: up one dir, main page]

Skip to content

Commit 6f1d7e8

Browse files
committed
Docstring fix.
1 parent d66d7db commit 6f1d7e8

File tree

5 files changed

+52
-50
lines changed

5 file 8000 s changed

+52
-50
lines changed

doc/modules/classes.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -868,6 +868,7 @@ details.
868868

869869
metrics.adjusted_mutual_info_score
870870
metrics.adjusted_rand_score
871+
metrics.calinski_harabaz_score
871872
metrics.completeness_score
872873
metrics.fowlkes_mallows_score
873874
metrics.homogeneity_completeness_v_measure

doc/modules/clustering.rst

Lines changed: 29 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -1344,18 +1344,20 @@ mean of homogeneity and completeness**:
13441344
Fowlkes-Mallows scores
13451345
----------------------
13461346

1347-
The Fowlkes-Mallows index (FMI) is defined as the geometric mean of
1348-
the pairwise precision and recall::
1347+
The Fowlkes-Mallows index (:func:`sklearn.metrics.fowlkes_mallows_score`) can be
1348+
used when the ground truth class assignments of the samples is known. The
1349+
Fowlkes-Mallows score FMI is defined as the geometric mean of the
1350+
pairwise precision and recall:
13491351

1350-
FMI = TP / sqrt((TP + FP) * (TP + FN))
1352+
.. math:: \text{FMI} = \frac{\text{TP}}{\sqrt{(\text{TP} + \text{FP}) (\text{TP} + \text{FN})}}
13511353

1352-
Where :math:`TP` is the number of **True Positive** (i.e. the number of pair
1353-
of points that belong to the same clusters in both labels_true and
1354-
labels_pred), :math:`FP` is the number of **False Positive** (i.e. the number
1355-
of pair of points that belong to the same clusters in labels_true and not
1356-
in labels_pred) and :math:`FN`is the number of **False Negative** (i.e the
1357-
number of pair of points that belongs in the same clusters in labels_pred
1358-
and not in labels_True).
1354+
Where ``TP`` is the number of **True Positive** (i.e. the number of pair
1355+
of points that belong to the same clusters in both the true labels and the
1356+
predicted labels), ``FP`` is the number of **False Positive** (i.e. the number
1357+
of pair of points that belong to the same clusters in the true labels and not
1358+
in the predicted labels) and ``FN`` is the number of **False Negative** (i.e the
1359+
number of pair of points that belongs in the same clusters in the predicted
1360+
labels and not in the true labels).
13591361

13601362
The score ranges from 0 to 1. A high value indicates a good similarity
13611363
between two clusters.
@@ -1505,24 +1507,28 @@ Calinski-Harabaz Index
15051507
----------------------
15061508

15071509
If the ground truth labels are not known, the Calinski-Harabaz index
1508-
(:func:'sklearn.metrics.calinski_harabaz_score') can be used to evaluate the
1510+
(:func:`sklearn.metrics.calinski_harabaz_score`) can be used to evaluate the
15091511
model, where a higher Calinski-Harabaz score relates to a model with better
15101512
defined clusters.
15111513

1512-
For :math:`k` clusters, the Calinski-Harabaz :math:`ch` is given as the ratio
1513-
of the between-clusters dispersion mean and the within-cluster dispersion:
1514+
For :math:`k` clusters, the Calinski-Harabaz score :math:`s` is given as the
1515+
ratio of the between-clusters dispersion mean and the within-cluster
1516+
dispersion:
15141517

15151518
.. math::
1516-
ch(k) = \frac{trace(B_k)}{trace(W_k)} \times \frac{N - k}{k - 1}
1517-
W_k = \sum_{q=1}^k \sum_{x \in C_q} (x - c_q) (x - c_q)^T \\
1518-
B_k = \sum_q n_q (c_q - c) (c_q - c)^T \\
1519+
s(k) = \frac{\mathrm{Tr}(B_k)}{\mathrm{Tr}(W_k)} \times \frac{N - k}{k - 1}
15191520
1520-
where:
1521-
- :math:`N` be the number of points in our data,
1522-
- :math:`C_q` be the set of points in cluster :math:`q`,
1523-
- :math:`c_q` be the center of cluster :math:`q`,
1524-
- :math:`c` be the center of :math:`E`,
1525-
- :math:`n_q` be the number of points in cluster :math:`q`:
1521+
where :math:`B_K` is the between group dispersion matrix and :math:`W_K`
1522+
is the within-cluster dispersion matrix defined by:
1523+
1524+
.. math:: W_k = \sum_{q=1}^k \sum_{x \in C_q} (x - c_q) (x - c_q)^T
1525+
1526+
.. math:: B_k = \sum_q n_q (c_q - c) (c_q - c)^T
1527+
1528+
with :math:`N` be the number of points in our data, :math:`C_q` be the set of
1529+
points in cluster :math:`q`, :math:`c_q` be the center of cluster
1530+
:math:`q`, :math:`c` be the center of :math:`E`, :math:`n_q` be the number of
1531+
points in cluster :math:`q`.
15261532

15271533

15281534
>>> from sklearn import metrics
@@ -1539,8 +1545,7 @@ cluster analysis.
15391545
>>> from sklearn.cluster import KMeans
15401546
>>> kmeans_model = KMeans(n_clusters=3, random_state=1).fit(X)
15411547
>>> labels = kmeans_model.labels_
1542-
>>> metrics.calinski_harabaz_score(X, labels)
1543-
... # doctest: +ELLIPSIS
1548+
>>> metrics.calinski_harabaz_score(X, labels) # doctest: +ELLIPSIS
15441549
560.39...
15451550

15461551

doc/sphinxext/gen_rst.py

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -26,12 +26,11 @@
2626
# Try Python 2 first, otherwise load from Python 3
2727
try:
2828
from StringIO import StringIO
29-
from BytesIO import BytesIO
3029
import cPickle as pickle
3130
import urllib2 as urllib
3231
from urllib2 import HTTPError, URLError
3332
except ImportError:
34-
from io import StringIO, BytesIO
33+
from io import StringIO
3534
import pickle
3635
import urllib.request
3736
import urllib.error
@@ -105,7 +104,7 @@ def _get_data(url):
105104
if encoding == 'plain':
106105
pass
107106
elif encoding == 'gzip':
108-
data = BytesIO(data)
107+
data = StringIO(data)
109108
data = gzip.GzipFile(fileobj=data).read()
110109
else:
111110
raise RuntimeError('unknown encoding')

sklearn/metrics/cluster/supervised.py

Lines changed: 11 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -824,23 +824,25 @@ def fowlkes_mallows_score(labels_true, labels_pred, max_n_classes=5000):
824824
825825
FMI = TP / sqrt((TP + FP) * (TP + FN))
826826
827-
Where :math:`TP` is the number of `True Positive` (i.e. the number of pair
828-
of points that belongs in the same clusters in both labels_true and
829-
labels_pred), :math:`FP` is the number of `False Positive` (i.e. the number
830-
of pair of points that belongs in the same clusters in labels_true and not
831-
in labels_pred) and :math:`FN`is the number of `False Negative` (i.e the
832-
number of pair of points that belongs in the same clusters in labels_pred
833-
and not in labels_True).
827+
Where ``TP`` is the number of **True Positive** (i.e. the number of pair of
828+
points that belongs in the same clusters in both ``labels_true`` and
829+
``labels_pred``), ``FP`` is the number of **False Positive** (i.e. the
830+
number of pair of points that belongs in the same clusters in
831+
``labels_true`` and not in ``labels_pred``) and ``FN`` is the number of
832+
**False Negative** (i.e the number of pair of points that belongs in the
833+
same clusters in ``labels_pred`` and not in ``labels_True``).
834834
835835
The score ranges from 0 to 1. A high value indicates a good similarity
836836
between two clusters.
837837
838+
Read more in the :ref:`User Guide <fowlkes_mallows_scores>`.
839+
838840
Parameters
839841
----------
840-
labels_true : int array, shape = (n_samples,)
842+
labels_true : int array, shape = (``n_samples``,)
841843
A clustering of the data into disjoint subsets.
842844
843-
labels_pred : array, shape = (n_samples, )
845+
labels_pred : array, shape = (``n_samples``, )
844846
A clustering of the data into disjoint subsets.
845847
846848
max_n_classes : int, optional (default=5000)

sklearn/metrics/cluster/unsupervised.py

Lines changed: 9 additions & 14 deletions
214
Original file line numberDiff line numberDiff line change
@@ -212,26 +212,21 @@ def calinski_harabaz_score(X, labels):
212212
"""Compute the Calinski and Harabaz score.
213213
214
The score is defined as ratio between the within-cluster dispersion and
215-
the between-cluster dispersion. For :math:`K` cluster, the Calinski and
216-
Harabaz score is defined as::
215+
the between-cluster dispersion.
217216
217+
Read more in the :ref:`User Guide <calinski_harabaz_index>`.
218218
219-
With :math:`B_K` the between group dispersion matrix and :math:`W_K`
220-
the within-cluster dispersion matrix defined by::
221-
the number of samples in the cluster :math:`k` and the mean of the samples
222-
in the cluster :math:`k` and the mean of all the samples.
223-
224-
Parameter
225-
---------
226-
X : array-like, shape (n_samples, n_features)
227-
List of n_features-dimensional data points. Each row corresponds
219+
Parameters
220+
----------
221+
X : array-like, shape (``n_samples``, ``n_features``)
222+
List of ``n_features``-dimensional data points. Each row corresponds
228223
to a single data point.
229224
230-
labels : array-like, shape (n_samples,)
225+
labels : array-like, shape (``n_samples``,)
231226
Predicted labels for each sample.
232227
233-
Return
234-
------
228+
Returns
229+
-------
235230
score: float
236231
The resulting Calinski-Harabaz score.
237232

0 commit comments

Comments
 (0)
0