From 6e9a3bbada02ad211df387782b8f4783facfc408 Mon Sep 17 00:00:00 2001
From: jnothman <jnothman@student.usyd.edu.au>
Date: Sat, 18 May 2013 22:19:44 +1000
Subject: [PATCH 1/5] DOC rewrite descriptions of P/R/F averages and define
 support

---
 doc/modules/model_evaluation.rst | 113 +++++++++++------------------
 sklearn/metrics/metrics.py       | 121 ++++++++++++++++---------------
 2 files changed, 102 insertions(+), 132 deletions(-)

diff --git a/doc/modules/model_evaluation.rst b/doc/modules/model_evaluation.rst
index 843fe96e7902e..4c3f68662b9ae 100644
--- a/doc/modules/model_evaluation.rst
+++ b/doc/modules/model_evaluation.rst
@@ -454,97 +454,66 @@ Multiclass and multilabel classification
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 In multiclass and multilabel classification task, the notions of precision,
 recall and F-measures can be applied to each label independently.
-
-Moreover, these notions can be further extended. The functions
-:func:`f1_score`, :func:`fbeta_score`, :func:`precision_recall_fscore_support`,
-:func:`precision_score`  and :func:`recall_score` support an argument called
-``average`` which defines the type of averaging:
-
- * ``"macro"``: average over classes (does not take imbalance
-   into account).
- * ``"micro"``: aggregate classes and average over instances
-   (takes imbalance into account). This implies that
-   ``precision == recall == F1``.
-   In multilabel classification, this is true only if every sample has a label.
- * ``'samples'``: average over instances. Only available and
-   meaningful with multilabel data.
- * ``"weighted"``: average over classes weighted by support (takes imbalance
-   into account). Can result in F-score that is not between
-   precision and recall.
- * ``None``: no averaging is performed.
-
-Let's define some notations:
-
-   * :math:`n_\text{labels}` and :math:`n_\text{samples}` denotes respectively
-     the number of labels and the number of samples.
-   * :math:`\texttt{precision}_j`, :math:`\texttt{recall}_j` and
-     :math:`{F_\beta}_j` are respectively the precision, the recall and
-     :math:`F_\beta` measure for the :math:`j`-th label;
-   * :math:`tp_j`, :math:`fp_j` and :math:`fn_j` respectively the number of
-     true positives, false positives and false negatives for the :math:`j`-th
-     label;
-   * :math:`w_j = \frac{tp_j + fn_j}{\sum_{k=0}^{n_\text{labels} - 1} tp_k + f
-     n_k}` is the weighted support associated to the :math:`j`-th label;
-   * :math:`y_i` is the set of true label and
-     :math:`\hat{y}_i` is the set of predicted for the
-     :math:`i`-th sample;
-
-The macro precision, recall and :math:`F_\beta` is defined as
-
-.. math::
-
-  \texttt{macro\_{}precision} = \frac{1}{n_\text{labels}} \sum_{j=0}^{n_\text{labels} - 1} \texttt{precision}_j,
-
-.. math::
-
-  \texttt{macro\_{}recall} = \frac{1}{n_\text{labels}} \sum_{j=0}^{n_\text{labels} - 1} \texttt{recall}_j,
-
-.. math::
-
-  \texttt{macro\_{}F\_{}beta} = \frac{1}{n_\text{labels}} \sum_{j=0}^{n_\text{labels} - 1} {F_\beta}_j.
-
-The micro precision, recall and :math:`F_\beta` is defined as
+For these purposes, it is clearer to redefine our metrics in terms of sets.
+For a true set :math:`\hat{y}` and predicted set :math:`y`, we may redefine:
 
 .. math::
 
-  \texttt{micro\_{}precision} = \frac{\sum_{j=0}^{n_\text{labels} - 1} tp_j}{\sum_{j=0}^{n_\text{labels} - 1} tp_j + \sum_{j=0}^{n_\text{labels} - 1} fp_j},
+    \text{precision} = \frac{\left| y \cap \hat{y} \right|}{\left|y\right|},
 
 .. math::
 
-  \texttt{micro\_{}recall} = \frac{\sum_{j=0}^{n_\text{labels} - 1} tp_j}{\sum_{j=0}^{n_\text{labels} - 1} tp_j + \sum_{j=0}^{n_\text{labels} - 1} fn_j},
+    \text{recall} = \frac{\left| y \cap \hat{y} \right|}{\left|\hat{y}\right|}.
 
-.. math::
-
-  \texttt{micro\_{}F\_{}beta} = (1 + \beta^2) \frac{\texttt{micro\_{}precision} \times  \texttt{micro\_{}recall}}{\beta^2 \texttt{micro\_{}precision} +  \texttt{micro\_{}recall}}.
-
-The weighted precision, recall and :math:`F_\beta` is defined as
-
-.. math::
-
-  \texttt{weighted\_{}precision} = \frac{1}{n_\text{labels}} \sum_{j=0}^{n_\text{labels} - 1} w_j \texttt{precision}_j,
-
-.. math::
-
-  \texttt{weighted\_{}recall} = \frac{1}{n_\text{labels}} \sum_{j=0}^{n_\text{labels} - 1} w_j \texttt{recall}_j,
-
-.. math::
+(Where either of these denominators is zero, the metric is 1 by convention.)
 
-  \texttt{weighted\_{}F\_{}beta} = \frac{1}{n_\text{labels}} \sum_{j=0}^{n_\text{labels} - 1} w_j {F_\beta}_j.
+Defining :math:`y_l` and :math:`\hat{y}_l` to be sets of samples with label
+:math:`l`, metrics for each label can easily be calculated as in the binary
+classification case.  There are a few ways to combine results across labels,
+specified by the ``average`` argument to the :func:`f1_score`,
+:func:`fbeta_score`, :func:`precision_recall_fscore_support`,
+:func:`precision_score`  and :func:`recall_score` functions:
 
+    * ``"micro"``: calculate metrics globally where :math:`y` and
+      :math:`\hat{y}` are sets of :math:`(sample, label)` pairs. In the
+      single-label case, this implies that precision, recall and :math:`F`
+      are equal.
+    * ``"macro"``: calculate metrics for each label, and find their unweighted
+      mean.  This does not take label imbalance into account.
+    * ``"weighted"``: calculate metrics for each label, and find their average
+      weighted by :math:`support_l = \left|\hat{y}_l\right|`.
+    * ``"samples"``: calculate metrics for each sample, comparing sets of
+      labels assigned to each, and find the mean across all samples.
+      This is only meaningful and available in the multilabel case.
+   * ``None``: calculate metrics for each label and do not average them.
 
-The sample-based precision, recall and :math:`F_\beta` is defined as
+More explicitly, we may notate:
 
 .. math::
 
-  \texttt{example\_{}precision}(y,\hat{y}) &= \frac{1}{n_\text{samples}} \sum_{i=0}^{n_\text{samples} - 1} \frac{|y_i \cap \hat{y}_i|}{|y_i|},
+    \text{avg\_{}precision} = \frac{1}{\sum_{w_j}} \sum_{j} w_j  \frac{\left| y_j \cap \hat{y}_j \right|}{\left|y_j\right|},
 
 .. math::
 
-  \texttt{example\_{}recall}(y,\hat{y}) &= \frac{1}{n_\text{samples}} \sum_{i=0}^{n_\text{samples} - 1} \frac{|y_i \cap \hat{y}_i|}{|\hat{y}_i|},
+    \text{avg\_{}recall} = \frac{1}{\sum_{w_j}} \sum_{j} w_j  \frac{\left| y_j \cap \hat{y}_j \right|}{\left|y_j\right|},
 
 .. math::
 
-  \texttt{example\_{}F\_{}beta}(y,\hat{y}) &= \frac{1}{n_\text{samples}} \sum_{i=0}^{n_\text{samples} - 1} (1 + \beta^2)\frac{|y_i \cap \hat{y}_i|}{\beta^2 |\hat{y}_i| + |y_i|}.
+    \text{avg\_{}F}_\beta = \frac{1}{\sum_{w_j}} \sum_{j} w_j (1 + \beta^2) \frac{\texttt{avg\_{}precision} \times  \texttt{avg\_{}recall}}{\beta^2 \texttt{avg\_{}precision} +  \texttt{avg\_{}recall}}
+
+with:
+
++---------------+-------------------------+---------------------------------+------------------------------------+
+|``average``    | :math:`j` iterates over | :math:`w_j` is                  | :math:`y_j` consists of            |
++===============+=========================+=================================+====================================+
+|``"micro"``    | [1]                     | 1                               | all (sample, label) pairs          |
++---------------+-------------------------+---------------------------------+------------------------------------+
+|``"macro"``    | labels                  | 1                               | samples assigned label :math:`j`   |
++---------------+-------------------------+---------------------------------+------------------------------------+
+|``"weighted"`` | labels                  | # true for label :math:`j`      | samples assigned label :math:`j`   |
++---------------+-------------------------+---------------------------------+------------------------------------+
+|``"samples"``  | samples                 | # labels for instance :math:`j` | labels assigned to sample :math:`j`|
++---------------+-------------------------+---------------------------------+------------------------------------+
 
 Here is an example where ``average`` is set to ``average`` to ``macro``::
 
diff --git a/sklearn/metrics/metrics.py b/sklearn/metrics/metrics.py
index 5eef2868b12f5..4f6d045b9224f 100644
--- a/sklearn/metrics/metrics.py
+++ b/sklearn/metrics/metrics.py
@@ -1114,20 +1114,20 @@ def f1_score(y_true, y_pred, labels=None, pos_label=1, average='weighted'):
         unless ``pos_label`` is given in binary classification, this
         determines the type of averaging performed on the data:
 
-        ``'samples'``:
-            Average over instance. Only meaningful and available in multilabel
-            classification.
-        ``'macro'``:
-            Average over classes (does not take imbalance into account).
         ``'micro'``:
-            Aggregate classes and average over instances (takes imbalance into
-            account).  This implies that ``precision == recall == F1``.
-            In multilabel classification, this is true only if every sample
-            has a label.
+            Calculate metrics globally by counting the total true positives,
+            false negatives and false positives.
+        ``'macro'``:
+            Calculate metrics for each label, and find their unweighted
+            mean.  This does not take label imbalance into account.
         ``'weighted'``:
-            Average over classes weighted by support (takes imbalance into
-            account).  Can result in F-score that is not between
-            precision and recall.
+            Calculate metrics for each label, and find their average, weighted
+            by support (the number of true instances for each label). This can
+            result in F-score that is not between precision and recall.
+        ``'samples'``:
+            Calculate metrics for each instance, and find their average (only
+            meaningful for multilabel classification where this differs from
+            :func:`accuracy_score`).
 
 
     Returns
@@ -1239,20 +1239,20 @@ def fbeta_score(y_true, y_pred, beta, labels=None, pos_label=1,
         unless ``pos_label`` is given in binary classification, this
         determines the type of averaging performed on the data:
 
-        ``'samples'``:
-            Average over instance. Only meaningful and available in multilabel
-            classification.
-        ``'macro'``:
-            Average over classes (does not take imbalance into account).
         ``'micro'``:
-            Aggregate classes and average over instances (takes imbalance into
-            account).  This implies that ``precision == recall == F1``.
-            In multilabel classification, this is true only if every sample
-            has a label.
+            Calculate metrics globally by counting the total true positives,
+            false negatives and false positives.
+        ``'macro'``:
+            Calculate metrics for each label, and find their unweighted
+            mean.  This does not take label imbalance into account.
         ``'weighted'``:
-            Average over classes weighted by support (takes imbalance into
-            account).  Can result in F-score that is not between
-            precision and recall.
+            Calculate metrics for each label, and find their average, weighted
+            by support (the number of true instances for each label). This can
+            result in F-score that is not between precision and recall.
+        ``'samples'``:
+            Calculate metrics for each instance, and find their average (only
+            meaningful for multilabel classification where this differs from
+            :func:`accuracy_score`).
 
     Returns
     -------
@@ -1524,20 +1524,20 @@ def precision_recall_fscore_support(y_true, y_pred, beta=1.0, labels=None,
         unless ``pos_label`` is given in binary classification, this
         determines the type of averaging performed on the data:
 
-        ``'macro'``:
-            Average over classes (does not take imbalance into account).
         ``'micro'``:
-            Aggregate classes and average over instances (takes imbalance into
-            account).  This implies that ``precision == recall == F1``.
-            In multilabel classification, this is true only if every sample
-            has a label.
+            Calculate metrics globally by counting the total true positives,
+            false negatives and false positives.
+        ``'macro'``:
+            Calculate metrics for each label, and find their unweighted
+            mean.  This does not take label imbalance into account.
+        ``'weighted'``:
+            Calculate metrics for each label, and find their average, weighted
+            by support (the number of true instances for each label). This can
+            result in F-score that is not between precision and recall.
         ``'samples'``:
-            Average over instance. Only meaningful and available in multilabel
-            classification.
-         ``'weighted'``:
-            Average over classes weighted by support (takes imbalance into
-            account).  Can result in F-score that is not between
-            precision and recall.
+            Calculate metrics for each instance, and find their average (only
+            meaningful for multilabel classification where this differs from
+            :func:`accuracy_score`).
 
 
     Returns
@@ -1553,6 +1553,7 @@ def precision_recall_fscore_support(y_true, y_pred, beta=1.0, labels=None,
 
     support: int (if average is not None) or array of int, shape =\
         [n_unique_labels]
+        The number of occurrences of each label in ``y_true``.
 
     References
     ----------
@@ -1816,20 +1817,20 @@ def precision_score(y_true, y_pred, labels=None, pos_label=1,
         unless ``pos_label`` is given in binary classification, this
         determines the type of averaging performed on the data:
 
-        ``'macro'``:
-            Average over classes (does not take imbalance into account).
         ``'micro'``:
-            Aggregate classes and average over instances (takes imbalance into
-            account).  This implies that ``precision == recall == F1``.
-            In multilabel classification, this is true only if every sample
-            has a label.
-        ``'samples'``:
-            Average over instance. Only meaningful and available in multilabel
-            classification.
+            Calculate metrics globally by counting the total true positives,
+            false negatives and false positives.
+        ``'macro'``:
+            Calculate metrics for each label, and find their unweighted
+            mean.  This does not take label imbalance into account.
         ``'weighted'``:
-            Average over classes weighted by support (takes imbalance into
-            account).  Can result in F-score that is not between
-            precision and recall.
+            Calculate metrics for each label, and find their average, weighted
+            by support (the number of true instances for each label). This can
+            result in F-score that is not between precision and recall.
+        ``'samples'``:
+            Calculate metrics for each instance, and find their average (only
+            meaningful for multilabel classification where this differs from
+            :func:`accuracy_score`).
 
     Returns
     -------
@@ -1939,20 +1940,20 @@ def recall_score(y_true, y_pred, labels=None, pos_label=1, average='weighted'):
         unless ``pos_label`` is given in binary classification, this
         determines the type of averaging performed on the data:
 
-        ``'macro'``:
-            Average over classes (does not take imbalance into account).
         ``'micro'``:
-            Aggregate classes and average over instances (takes imbalance into
-            account).  This implies that ``precision == recall == F1``.
-            In multilabel classification, this is true only if every sample
-            has a label.
-        ``'samples'``:
-            Average over instance. Only meaningful and available in multilabel
-            classification.
+            Calculate metrics globally by counting the total true positives,
+            false negatives and false positives.
+        ``'macro'``:
+            Calculate metrics for each label, and find their unweighted
+            mean.  This does not take label imbalance into account.
         ``'weighted'``:
-            Average over classes weighted by support (takes imbalance into
-            account).  Can result in F-score that is not between
-            precision and recall.
+            Calculate metrics for each label, and find their average, weighted
+            by support (the number of true instances for each label). This can
+            result in F-score that is not between precision and recall.
+        ``'samples'``:
+            Calculate metrics for each instance, and find their average (only
+            meaningful for multilabel classification where this differs from
+            :func:`accuracy_score`).
 
     Returns
     -------

From c2c8dd8b3882e6b85dfa71047fa3ecc7ccb82664 Mon Sep 17 00:00:00 2001
From: Joel Nothman <joel.nothman@gmail.com>
Date: Tue, 21 May 2013 10:20:00 +1000
Subject: [PATCH 2/5] DOC cleaner descriptions of PRF averages

Thanks to arjoly
---
 doc/modules/model_evaluation.rst | 100 ++++++++++++++-----------------
 1 file changed, 45 insertions(+), 55 deletions(-)

diff --git a/doc/modules/model_evaluation.rst b/doc/modules/model_evaluation.rst
index 4c3f68662b9ae..5a0d4cb18168e 100644
--- a/doc/modules/model_evaluation.rst
+++ b/doc/modules/model_evaluation.rst
@@ -454,66 +454,56 @@ Multiclass and multilabel classification
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 In multiclass and multilabel classification task, the notions of precision,
 recall and F-measures can be applied to each label independently.
-For these purposes, it is clearer to redefine our metrics in terms of sets.
-For a true set :math:`\hat{y}` and predicted set :math:`y`, we may redefine:
-
-.. math::
-
-    \text{precision} = \frac{\left| y \cap \hat{y} \right|}{\left|y\right|},
-
-.. math::
-
-    \text{recall} = \frac{\left| y \cap \hat{y} \right|}{\left|\hat{y}\right|}.
-
-(Where either of these denominators is zero, the metric is 1 by convention.)
-
-Defining :math:`y_l` and :math:`\hat{y}_l` to be sets of samples with label
-:math:`l`, metrics for each label can easily be calculated as in the binary
-classification case.  There are a few ways to combine results across labels,
+There are a few ways to combine results across labels,
 specified by the ``average`` argument to the :func:`f1_score`,
 :func:`fbeta_score`, :func:`precision_recall_fscore_support`,
 :func:`precision_score`  and :func:`recall_score` functions:
 
-    * ``"micro"``: calculate metrics globally where :math:`y` and
-      :math:`\hat{y}` are sets of :math:`(sample, label)` pairs. In the
-      single-label case, this implies that precision, recall and :math:`F`
-      are equal.
-    * ``"macro"``: calculate metrics for each label, and find their unweighted
-      mean.  This does not take label imbalance into account.
-    * ``"weighted"``: calculate metrics for each label, and find their average
-      weighted by :math:`support_l = \left|\hat{y}_l\right|`.
-    * ``"samples"``: calculate metrics for each sample, comparing sets of
-      labels assigned to each, and find the mean across all samples.
-      This is only meaningful and available in the multilabel case.
-   * ``None``: calculate metrics for each label and do not average them.
-
-More explicitly, we may notate:
-
-.. math::
-
-    \text{avg\_{}precision} = \frac{1}{\sum_{w_j}} \sum_{j} w_j  \frac{\left| y_j \cap \hat{y}_j \right|}{\left|y_j\right|},
-
-.. math::
-
-    \text{avg\_{}recall} = \frac{1}{\sum_{w_j}} \sum_{j} w_j  \frac{\left| y_j \cap \hat{y}_j \right|}{\left|y_j\right|},
-
-.. math::
+* ``"micro"``: calculate metrics globally by counting the total true
+  positives, false negatives and false positives. In the single-label case,
+  this implies that precision, recall and :math:`F` are equal.
+* ``"samples"``: calculate metrics for each sample, comparing sets of
+  labels assigned to each, and find the mean across all samples.
+  This is only meaningful and available in the multilabel case.
+* ``"macro"``: calculate metrics for each label, and find their mean.
+  This does not take label imbalance into account.
+* ``"weighted"``: calculate metrics for each label, and find their average
+  weighted by the number of occurrences of the label in the true data.
+* ``None``: calculate metrics for each label and do not average them.
+
+To make this more explicit, consider the following notation:
+
+* :math:`y` the set of *predicted* :math:`(sample, label)` pairs
+* :math:`\hat{y}` the set of *true* :math:`(sample, label)` pairs
+* :math:`L` the set of labels
+* :math:`S` the set of samples
+* :math:`y_s` the subset of :math:`y` with sample :math:`s`,
+  i.e. :math:`y_s := \left\{(s', l) \in y | s' = s\right\}`
+* :math:`y_l` the subset of :math:`y` with label :math:`l`
+* similarly, :math:`\hat{y}_s` and :math:`\hat{y}_l` are subsets of
+  :math:`\hat{y}`
+* :math:`P(A, B) := \frac{\left| A \cap B \right|}{\left|A\right|}`
+  (Where :math:`A = \emptyset`, :math:`P(A, B):=1`.)
+* :math:`R(A, B) := \frac{\left| A \cap B \right|}{\left|B\right|}`
+  (Where :math:`B = \emptyset`, :math:`R(A, B):=1`.)
+* :math:`F_\beta(A, B) := \left(1 + \beta^2\right) \frac{P(A, B) \times R(A, B)}{\beta^2 P(A, B) + R(A, B)}`
+
+Then the metrics are defined as:
+
++---------------+------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------+
+|``average``    | Precision                                                                                                        | Recall                                                                                                           | F\_{}beta                                                                                                            |
++===============+==================================================================================================================+==================================================================================================================+======================================================================================================================+
+|``"micro"``    | :math:`P(y, \hat{y})`                                                                                            | :math:`R(y, \hat{y})`                                                                                            | :math:`F_\beta(y, \hat{y})`                                                                                          |
++---------------+------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------+
+|``"samples"``  | :math:`\frac{1}{\left|S\right|} \sum_{s \in S} P(y_s, \hat{y}_s)`                                                | :math:`\frac{1}{\left|S\right|} \sum_{s \in S} R(y_s, \hat{y}_s)`                                                | :math:`\frac{1}{\left|S\right|} \sum_{s \in S} F_\beta(y_s, \hat{y}_s)`                                              |
++---------------+------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------+
+|``"macro"``    | :math:`\frac{1}{\left|L\right|} \sum_{l \in L} P(y_l, \hat{y}_l)`                                                | :math:`\frac{1}{\left|L\right|} \sum_{l \in L} R(y_l, \hat{y}_l)`                                                | :math:`\frac{1}{\left|L\right|} \sum_{l \in L} F_\beta(y_l, \hat{y}_l)`                                              |
++---------------+------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------+
+|``"weighted"`` | :math:`\frac{1}{\sum_{l \in L} \left|\hat{y}_l\right|} \sum_{l \in L} \left|\hat{y}_l\right| P(y_l, \hat{y}_l)`  | :math:`\frac{1}{\sum_{l \in L} \left|\hat{y}_l\right|} \sum_{l \in L} \left|\hat{y}_l\right| R(y_l, \hat{y}_l)`  | :math:`\frac{1}{\sum_{l \in L} \left|\hat{y}_l\right|} \sum_{l \in L} \left|\hat{y}_l\right| F_\beta(y_l, \hat{y}_l)`|
++---------------+------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------+
+|``None``       | :math:`\langle P(y_l, \hat{y}_l) | l \in L \rangle`                                                              | :math:`\langle R(y_l, \hat{y}_l) | l \in L \rangle`                                                              | :math:`\langle F_\beta(y_l, \hat{y}_l) | l \in L \rangle`                                                            |
++---------------+------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------+
 
-    \text{avg\_{}F}_\beta = \frac{1}{\sum_{w_j}} \sum_{j} w_j (1 + \beta^2) \frac{\texttt{avg\_{}precision} \times  \texttt{avg\_{}recall}}{\beta^2 \texttt{avg\_{}precision} +  \texttt{avg\_{}recall}}
-
-with:
-
-+---------------+-------------------------+---------------------------------+------------------------------------+
-|``average``    | :math:`j` iterates over | :math:`w_j` is                  | :math:`y_j` consists of            |
-+===============+=========================+=================================+====================================+
-|``"micro"``    | [1]                     | 1                               | all (sample, label) pairs          |
-+---------------+-------------------------+---------------------------------+------------------------------------+
-|``"macro"``    | labels                  | 1                               | samples assigned label :math:`j`   |
-+---------------+-------------------------+---------------------------------+------------------------------------+
-|``"weighted"`` | labels                  | # true for label :math:`j`      | samples assigned label :math:`j`   |
-+---------------+-------------------------+---------------------------------+------------------------------------+
-|``"samples"``  | samples                 | # labels for instance :math:`j` | labels assigned to sample :math:`j`|
-+---------------+-------------------------+---------------------------------+------------------------------------+
 
 Here is an example where ``average`` is set to ``average`` to ``macro``::
 

From 799060f7142604c426a0985be1b440a85b9280d4 Mon Sep 17 00:00:00 2001
From: Joel Nothman <joel.nothman@gmail.com>
Date: Tue, 21 May 2013 10:22:06 +1000
Subject: [PATCH 3/5] DOC avoid 'single-label case'

---
 doc/modules/model_evaluation.rst | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/doc/modules/model_evaluation.rst b/doc/modules/model_evaluation.rst
index 5a0d4cb18168e..87b82e4e81fef 100644
--- a/doc/modules/model_evaluation.rst
+++ b/doc/modules/model_evaluation.rst
@@ -460,8 +460,8 @@ specified by the ``average`` argument to the :func:`f1_score`,
 :func:`precision_score`  and :func:`recall_score` functions:
 
 * ``"micro"``: calculate metrics globally by counting the total true
-  positives, false negatives and false positives. In the single-label case,
-  this implies that precision, recall and :math:`F` are equal.
+  positives, false negatives and false positives. Except in the multi-label
+  case this implies that precision, recall and :math:`F` are equal.
 * ``"samples"``: calculate metrics for each sample, comparing sets of
   labels assigned to each, and find the mean across all samples.
   This is only meaningful and available in the multilabel case.

From d15365ff225b3b4604aa7e8abbacd946466a4872 Mon Sep 17 00:00:00 2001
From: Joel Nothman <joel.nothman@gmail.com>
Date: Tue, 21 May 2013 10:25:00 +1000
Subject: [PATCH 4/5] Remove spurious {}

---
 doc/modules/model_evaluation.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/doc/modules/model_evaluation.rst b/doc/modules/model_evaluation.rst
index 87b82e4e81fef..a675b2565e072 100644
--- a/doc/modules/model_evaluation.rst
+++ b/doc/modules/model_evaluation.rst
@@ -491,7 +491,7 @@ To make this more explicit, consider the following notation:
 Then the metrics are defined as:
 
 +---------------+------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------+
-|``average``    | Precision                                                                                                        | Recall                                                                                                           | F\_{}beta                                                                                                            |
+|``average``    | Precision                                                                                                        | Recall                                                                                                           | F\_beta                                                                                                              |
 +===============+==================================================================================================================+==================================================================================================================+======================================================================================================================+
 |``"micro"``    | :math:`P(y, \hat{y})`                                                                                            | :math:`R(y, \hat{y})`                                                                                            | :math:`F_\beta(y, \hat{y})`                                                                                          |
 +---------------+------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------+

From 71590ef20a03d4ea37bcb1c91f7d558234374435 Mon Sep 17 00:00:00 2001
From: Joel Nothman <joel.nothman@gmail.com>
Date: Tue, 21 May 2013 21:50:57 +1000
Subject: [PATCH 5/5] Note on 'weighted' average and imbalance

---
 doc/modules/model_evaluation.rst |  2 ++
 sklearn/metrics/metrics.py       | 27 ++++++++++++++++-----------
 2 files changed, 18 insertions(+), 11 deletions(-)

diff --git a/doc/modules/model_evaluation.rst b/doc/modules/model_evaluation.rst
index a675b2565e072..40a55970a7608 100644
--- a/doc/modules/model_evaluation.rst
+++ b/doc/modules/model_evaluation.rst
@@ -469,6 +469,8 @@ specified by the ``average`` argument to the :func:`f1_score`,
   This does not take label imbalance into account.
 * ``"weighted"``: calculate metrics for each label, and find their average
   weighted by the number of occurrences of the label in the true data.
+  This alters ``"macro"`` to account for label imbalance; it may produce an
+  F-score that is not between precision and recall.
 * ``None``: calculate metrics for each label and do not average them.
 
 To make this more explicit, consider the following notation:
diff --git a/sklearn/metrics/metrics.py b/sklearn/metrics/metrics.py
index 4f6d045b9224f..bc9c9f8c78d7c 100644
--- a/sklearn/metrics/metrics.py
+++ b/sklearn/metrics/metrics.py
@@ -1122,8 +1122,9 @@ def f1_score(y_true, y_pred, labels=None, pos_label=1, average='weighted'):
             mean.  This does not take label imbalance into account.
         ``'weighted'``:
             Calculate metrics for each label, and find their average, weighted
-            by support (the number of true instances for each label). This can
-            result in F-score that is not between precision and recall.
+            by support (the number of true instances for each label). This
+            alters 'macro' to account for label imbalance; it can result in an
+            F-score that is not between precision and recall.
         ``'samples'``:
             Calculate metrics for each instance, and find their average (only
             meaningful for multilabel classification where this differs from
@@ -1247,8 +1248,9 @@ def fbeta_score(y_true, y_pred, beta, labels=None, pos_label=1,
             mean.  This does not take label imbalance into account.
         ``'weighted'``:
             Calculate metrics for each label, and find their average, weighted
-            by support (the number of true instances for each label). This can
-            result in F-score that is not between precision and recall.
+            by support (the number of true instances for each label). This
+            alters 'macro' to account for label imbalance; it can result in an
+            F-score that is not between precision and recall.
         ``'samples'``:
             Calculate metrics for each instance, and find their average (only
             meaningful for multilabel classification where this differs from
@@ -1497,7 +1499,7 @@ def precision_recall_fscore_support(y_true, y_pred, beta=1.0, labels=None,
 
     If ``pos_label is None`` and in binary classification, this function
     returns the average precision, recall and F-measure if ``average``
-    is one of ``'micro'``, ``'macro'``, ``'weighted'``.
+    is one of ``'micro'``, ``'macro'``, ``'weighted'`` or ``'samples'``.
 
     Parameters
     ----------
@@ -1532,8 +1534,9 @@ def precision_recall_fscore_support(y_true, y_pred, beta=1.0, labels=None,
             mean.  This does not take label imbalance into account.
         ``'weighted'``:
             Calculate metrics for each label, and find their average, weighted
-            by support (the number of true instances for each label). This can
-            result in F-score that is not between precision and recall.
+            by support (the number of true instances for each label). This
+            alters 'macro' to account for label imbalance; it can result in an
+            F-score that is not between precision and recall.
         ``'samples'``:
             Calculate metrics for each instance, and find their average (only
             meaningful for multilabel classification where this differs from
@@ -1825,8 +1828,9 @@ def precision_score(y_true, y_pred, labels=None, pos_label=1,
             mean.  This does not take label imbalance into account.
         ``'weighted'``:
             Calculate metrics for each label, and find their average, weighted
-            by support (the number of true instances for each label). This can
-            result in F-score that is not between precision and recall.
+            by support (the number of true instances for each label). This
+            alters 'macro' to account for label imbalance; it can result in an
+            F-score that is not between precision and recall.
         ``'samples'``:
             Calculate metrics for each instance, and find their average (only
             meaningful for multilabel classification where this differs from
@@ -1948,8 +1952,9 @@ def recall_score(y_true, y_pred, labels=None, pos_label=1, average='weighted'):
             mean.  This does not take label imbalance into account.
         ``'weighted'``:
             Calculate metrics for each label, and find their average, weighted
-            by support (the number of true instances for each label). This can
-            result in F-score that is not between precision and recall.
+            by support (the number of true instances for each label). This
+            alters 'macro' to account for label imbalance; it can result in an
+            F-score that is not between precision and recall.
         ``'samples'``:
             Calculate metrics for each instance, and find their average (only
             meaningful for multilabel classification where this differs from