scikit-learn
diff --git a/‎doc/modules/model_evaluation.rst
Lines changed: 37 additions & 4 deletions b/‎doc/modules/model_evaluation.rst
Lines changed: 37 additions & 4 deletions
diff --git a/‎examples/model_selection/plot_precision_recall.py
Lines changed: 10 additions & 5 deletions b/‎examples/model_selection/plot_precision_recall.py
Lines changed: 10 additions & 5 deletions
diff --git a/‎sklearn/metrics/ranking.py
Lines changed: 31 additions & 13 deletions b/‎sklearn/metrics/ranking.py
Lines changed: 31 additions & 13 deletions
@@ -634,10 +634,25 @@ The :func:`precision_recall_curve` computes a precision-recall curve
 from the ground truth label and a score given by the classifier
 by varying a decision threshold.
 
-The :func:`average_precision_score` function computes the average precision
-(AP) from prediction scores. This score corresponds to the area under the
-precision-recall curve. The value is between 0 and 1 and higher is better.
-With random predictions, the AP is the fraction of positive samples.
+The :func:`average_precision_score` function computes the
+`average precision <http://en.wikipedia.org/w/index.php?title=Information_retrieval&oldid=793358396#Average_precision>`_
+(AP) from prediction scores. The value is between 0 and 1 and higher is better.
+AP is defined as
+
+.. math::
+    \text{AP} = \sum_n (R_n - R_{n-1}) P_n
+
+where :math:`P_n` and :math:`R_n` are the precision and recall at the
+nth threshold. With random predictions, the AP is the fraction of positive
+samples.
+
+References [Manning2008]_ and [Everingham2010]_ present alternative variants of
+AP that interpolate the precision-recall curve. Currently,
+:func:`average_precision_score` does not implement any interpolated variant.
+References [Davis2006]_ and [Flach2015]_ describe why a linear interpolation of
+points on the precision-recall curve provides an overly-optimistic measure of
+classifier performance. This linear interpolation is used when computing area
+under the curve with the trapezoidal rule in :func:`auc`.
 
 Several functions allow you to analyze the precision, recall and F-measures
 score:
@@ -672,6 +687,24 @@ binary classification and multilabel indicator format.
     for an example of :func:`precision_recall_curve` usage to evaluate
     classifier output quality.
 
+
+.. topic:: References:
+
+  .. [Manning2008] C.D. Manning, P. Raghavan, H. Schütze, `Introduction to Information Retrieval
+     <http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-ranked-retrieval-results-1.html>`_,
+     2008.
+  .. [Everingham2010] M. Everingham, L. Van Gool, C.K.I. Williams, J. Winn, A. Zisserman,
+     `The Pascal Visual Object Classes (VOC) Challenge
+     <http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.157.5766&rep=rep1&type=pdf>`_,
+     IJCV 2010.
+  .. [Davis2006] J. Davis, M. Goadrich, `The Relationship Between Precision-Recall and ROC Curves
+     <http://www.machinelearning.org/proceedings/icml2006/030_The_Relationship_Bet.pdf>`_,
+     ICML 2006.
+  .. [Flach2015] P.A. Flach, M. Kull, `Precision-Recall-Gain Curves: PR Analysis Done Right
+     <http://papers.nips.cc/paper/5867-precision-recall-gain-curves-pr-analysis-done-right.pdf>`_,
+     NIPS 2015.
+
+
 Binary classification
 ^^^^^^^^^^^^^^^^^^^^^
 
 
@@ -61,16 +61,21 @@
 in the threshold considerably reduces precision, with only a minor gain in
 recall.
 
-**Average precision** summarizes such a plot as the weighted mean of precisions
-achieved at each threshold, with the increase in recall from the previous
-threshold used as the weight:
+**Average precision** (AP) summarizes such a plot as the weighted mean of
+precisions achieved at each threshold, with the increase in recall from the
+previous threshold used as the weight:
 
 :math:`\\text{AP} = \\sum_n (R_n - R_{n-1}) P_n`
 
 where :math:`P_n` and :math:`R_n` are the precision and recall at the
 nth threshold. A pair :math:`(R_k, P_k)` is referred to as an
 *operating point*.
 
+AP and the trapezoidal area under the operating points
+(:func:`sklearn.metrics.auc`) are common ways to summarize a precision-recall
+curve that lead to different results. Read more in the
+:ref:`User Guide <precision_recall_f_measure_metrics>`.
+
 Precision-recall curves are typically used in binary classification to study
 the output of a classifier. In order to extend the precision-recall curve and
 average precision to multi-class or multi-label classification, it is necessary
@@ -144,7 +149,7 @@<
B41A
/div>
 plt.ylabel('Precision')
 plt.ylim([0.0, 1.05])
 plt.xlim([0.0, 1.0])
-plt.title('2-class Precision-Recall curve: AUC={0:0.2f}'.format(
+plt.title('2-class Precision-Recall curve: AP={0:0.2f}'.format(
           average_precision))
 
 ###############################################################################
@@ -215,7 +220,7 @@
 plt.ylim([0.0, 1.05])
 plt.xlim([0.0, 1.0])
 plt.title(
-    'Average precision score, micro-averaged over all classes: AUC={0:0.2f}'
+    'Average precision score, micro-averaged over all classes: AP={0:0.2f}'
     .format(average_precision["micro"]))
 
 ###############################################################################
 
@@ -40,7 +40,9 @@ def auc(x, y, reorder=False):
     """Compute Area Under the Curve (AUC) using the trapezoidal rule
 
     This is a general function, given points on a curve.  For computing the
-    area under the ROC-curve, see :func:`roc_auc_score`.
+    area under the ROC-curve, see :func:`roc_auc_score`.  For an alternative
+    way to summarize a precision-recall curve, see
+    :func:`average_precision_score`.
 
     Parameters
     ----------
@@ -68,7 +70,8 @@ def auc(x, y, reorder=False):
 
     See also
     --------
-    roc_auc_score : Computes the area under the ROC curve
+    roc_auc_score : Compute the area under the ROC curve
+    average_precision_score : Compute average precision from prediction scores
     precision_recall_curve :
         Compute precision-recall pairs for different probability thresholds
     """
@@ -108,6 +111,19 @@ def average_precision_score(y_true, y_score, average="macro",
                             sample_weight=None):
     """Compute average precision (AP) from prediction scores
 
+    AP summarizes a precision-recall curve as the weighted mean of precisions
+    achieved at each threshold, with the increase in recall from the previous
+    threshold used as the weight:
+
+    .. math::
+        \\text{AP} = \\sum_n (R_n - R_{n-1}) P_n
+
+    where :math:`P_n` and :math:`R_n` are the precision and recall at the nth
+    threshold [1]_. This implementation is not interpolated and is different
+    from computing the area under the precision-recall curve with the
+    trapezoidal rule, which uses linear interpolation and can be too
+    optimistic.
+
     Note: this implementation is restricted to the binary classification task
     or multilabel classification task.
 
     References
     ----------
     .. [1] `Wikipedia entry for the Average precision
-           <http://en.wikipedia.org/wiki/Average_precision>`_
-    .. [2] `Stanford Information Retrieval book
-            <http://nlp.stanford.edu/IR-book/html/htmledition/
-            evaluation-of-ranked-retrieval-results-1.html>`_
-    .. [3] `The PASCAL Visual Object Classes (VOC) Challenge
-            <http://citeseerx.ist.psu.edu/viewdoc/
-            download?doi=10.1.1.157.5766&rep=rep1&type=pdf>`_
+           <http://en.wikipedia.org/w/index.php?title=Information_retrieval&
+           oldid=793358396#Average_precision>`_
 
     See also
     --------
-    roc_auc_score : Area under the ROC curve
+    roc_auc_score : Compute the area under the ROC curve
 
     precision_recall_curve :
         Compute precision-recall pairs for different probability thresholds
@@ -190,7 +201,8 @@ def _binary_uninterpolated_average_precision(
 
 
 def roc_auc_score(y_true, y_score, average="macro", sample_weight=None):
-    """Compute Area Under the Curve (AUC) from prediction scores
+    """Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC)
+    from prediction scores.
 
     Note: this implementation is restricted to the binary classification task
     or multilabel classification task in label indicator format.
@@ -239,7 +251,7 @@ def roc_auc_score(y_true, y_score, average="macro", sample_weight=None):
     --------
     average_precision_score : Area under the precision-recall curve
 
-    roc_curve : Compute Receiver operating characteristic (ROC)
+    roc_curve : Compute Receiver operating characteristic (ROC) curve
 
     Examples
     --------
@@ -396,6 +408,12 @@ def precision_recall_curve(y_true, probas_pred, pos_label=None,
         Increasing thresholds on the decision function used to compute
         precision and recall.
 
+    See also
+    --------
+    average_precision_score : Compute average precision from prediction scores
+
+    roc_curve : Compute Receiver operating characteristic (ROC) curve
+
     Examples
     --------
     >>> import numpy as np
@@ -477,7 +495,7 @@ def roc_curve(y_true, y_score, pos_label=None, sample_weight=None,
 
     See also
     --------
-    roc_auc_score : Compute Area Under the Curve (AUC) from prediction scores
+    roc_auc_score : Compute the area under the ROC curve
 
     Notes
     -----