scikit-learn · GaelVaroquaux · Sep 7, 2016 · Jun 6, 2017 · Jun 9, 2017
diff --git a/doc/modules/model_evaluation.rst b/doc/modules/model_evaluation.rst
@@ -681,7 +681,7 @@ Here are some small examples in binary classification::
   >>> threshold
   array([ 0.35,  0.4 ,  0.8 ])
   >>> average_precision_score(y_true, y_scores)  # doctest: +ELLIPSIS
-  0.79...
+  0.83...
 
 
 

diff --git a/doc/whats_new.rst b/doc/whats_new.rst
@@ -6,7 +6,7 @@ Release history
 ===============
 
 Version 0.19
-============
+==============
 
 **In Development**
 
@@ -193,6 +193,13 @@ Enhancements
 Bug fixes
 .........
 
+   - :func:`metrics.ranking.average_precision_score` no longer linearly
+     interpolates between operating points, and instead weighs precisions
+     by the change in recall since the last operating point, as per the
+     `Wikipedia entry <http://en.wikipedia.org/wiki/Average_precision>`_.
+     (`#7356 <https://github.com/scikit-learn/scikit-learn/pull/7356>`_). By
+     `Nick Dingwall`_ and `Gael Varoquaux`_.
+
    - Fixed a bug in :class:`sklearn.covariance.MinCovDet` where inputting data
      that produced a singular covariance matrix would cause the helper method
      `_c_step` to throw an exception.

diff --git a/examples/model_selection/plot_precision_recall.py b/examples/model_selection/plot_precision_recall.py
@@ -5,13 +5,18 @@
 
 Example of Precision-Recall metric to evaluate classifier output quality.
 
-In information retrieval, precision is a measure of result relevancy, while
-recall is a measure of how many truly relevant results are returned. A high
-area under the curve represents both high recall and high precision, where high
-precision relates to a low false positive rate, and high recall relates to a
-low false negative rate. High scores for both show that the classifier is
-returning accurate results (high precision), as well as returning a majority of
-all positive results (high recall).
+Precision-Recall is a useful measure of success of prediction when the
+classes are very imbalanced. In information retrieval, precision is a
+measure of result relevancy, while recall is a measure of how many truly
+relevant results are returned.
+
+The precision-recall curve shows the tradeoff between precision and
+recall for different threshold. A high area under the curve represents
+both high recall and high precision, where high precision relates to a
+low false positive rate, and high recall relates to a low false negative
+rate. High scores for both show that the classifier is returning accurate
+results (high precision), as well as returning a majority of all positive
+results (high recall).
 
 A system with high recall but low precision returns many results, but most of
 its predicted labels are incorrect when compared to the training labels. A
@@ -37,7 +42,7 @@
 
 :math:`F1 = 2\\frac{P \\times R}{P+R}`
 
-It is important to note that the precision may not decrease with recall. The
+Note that the precision may not decrease with recall. The
 definition of precision (:math:`\\frac{T_p}{T_p + F_p}`) shows that lowering
 the threshold of a classifier may increase the denominator, by increasing the
 number of results returned. If the threshold was previously set too high, the
@@ -54,11 +59,20 @@
 The relationship between recall and precision can be observed in the
 stairstep area of the plot - at the edges of these steps a small change
 in the threshold considerably reduces precision, with only a minor gain in
-recall. See the corner at recall = .59, precision = .8 for an example of this
-phenomenon.
+recall.
+
+**Average precision** summarizes such a plot as the weighted mean of precisions
+achieved at each threshold, with the increase in recall from the previous
+threshold used as the weight:
+
+:math:`\\text{AP} = \\sum_n (R_n - R_{n-1}) P_n`
+
+where :math:`P_n` and :math:`R_n` are the precision and recall at the
+nth threshold. A pair :math:`(R_k, P_k)` is referred to as an
+*operating point*.
 
 Precision-recall curves are typically used in binary classification to study
-the output of a classifier. In order to extend Precision-recall curve and
+the output of a classifier. In order to extend the precision-recall curve and
 average precision to multi-class or multi-label classification, it is necessary
 to binarize the output. One curve can be drawn per label, but one can also
10000
 draw
 a precision-recall curve by considering each element of the label indicator
@@ -71,76 +85,148 @@
              :func:`sklearn.metrics.precision_score`,
              :func:`sklearn.metrics.f1_score`
 """
-print(__doc__)
-
-import matplotlib.pyplot as plt
-import numpy as np
-from itertools import cycle
+from __future__ import print_function
 
+###############################################################################
+# In binary classification settings
+# --------------------------------------------------------
+#
+# Create simple data
+# ..................
+#
+# Try to differentiate the two first classes of the iris data
 from sklearn import svm, datasets
-from sklearn.metrics import precision_recall_curve
-from sklearn.metrics import average_precision_score
 from sklearn.model_selection import train_test_split
-from sklearn.preprocessing import label_binarize
-from sklearn.multiclass import OneVsRestClassifier
+import numpy as np
 
-# import some data to play with
 iris = datasets.load_iris()
 X = iris.data
 y = iris.target
 
-# setup plot details
-colors = cycle(['navy', 'turquoise', 'darkorange', 'cornflowerblue', 'teal'])
-lw = 2
-
-# Binarize the output
-y = label_binarize(y, classes=[0, 1, 2])
-n_classes = y.shape[1]
-
 # Add noisy features
 random_state = np.random.RandomState(0)
 n_samples, n_features = X.shape
 X = np.c_[X, random_state.randn(n_samples, 200 * n_features)]
 
+# Limit to the two first classes, and split into training and test
+X_train, X_test, y_train, y_test = train_test_split(X[y < 2], y[y < 2],
+                                                    test_size=.5,
+                                                    random_state=random_state)
+
+# Create a simple classifier
+classifier = svm.LinearSVC(random_state=random_state)
+classifier.fit(X_train, y_train)
+y_score = classifier.decision_function(X_test)
+
+###############################################################################
+# Compute the average precision score
+# ...................................
+from sklearn.metrics import average_precision_score
+average_precision = average_precision_score(y_test, y_score)
+
+print('Average precision-recall score: {0:0.2f}'.format(
+      average_precision))
+
+###############################################################################
+# Plot the Precision-Recall curve
+# ................................
+from sklearn.metrics import precision_recall_curve
+import matplotlib.pyplot as plt
+
+precision, recall, _ = precision_recall_curve(y_test, y_score)
+
+plt.step(recall, precision, color='b', alpha=0.2,
+         where='post')
+plt.fill_between(recall, precision, step='post', alpha=0.2,
+                 color='b')
+
+plt.xlabel('Recall')
+plt.ylabel('Precision')
+plt.ylim([0.0, 1.05])
+plt.xlim([0.0, 1.0])
+plt.title('2-class Precision-Recall curve: AUC={0:0.2f}'.format(
+          average_precision))
+
+###############################################################################
+# In multi-label settings
+# ------------------------
+#
+# Create multi-label data, fit, and predict
+# ...........................................
+#
+# We create a multi-label dataset, to illustrate the precision-recall in
+# multi-label settings
+
+from sklearn.preprocessing import label_binarize
+
+# Use label_binarize to be multi-label like settings
+Y = label_binarize(y, classes=[0, 1, 2])
+n_classes = Y.shape[1]
+
 # Split into training and test
-X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5,
+X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=.5,
                                                     random_state=random_state)
 
+# We use OneVsRestClassifier for multi-label prediction
+from sklearn.multiclass import OneVsRestClassifier
+
 # Run classifier
-classifier = OneVsRestClassifier(svm.SVC(kernel='linear', probability=True,
-                                 random_state=random_state))
-y_score = classifier.fit(X_train, y_train).decision_function(X_test)
+classifier = OneVsRestClassifier(svm.LinearSVC(random_state=random_state))
+classifier.fit(X_train, Y_train)
+y_score = classifier.decision_function(X_test)
+
 
-# Compute Precision-Recall and plot curve
+###############################################################################
+# The average precision score in multi-label settings
+# ....................................................
+from sklearn.metrics import precision_recall_curve
+from sklearn.metrics import average_precision_score
+
+# For each class
 precision = dict()
 recall = dict()
 average_precision = dict()
 for i in range(n_classes):
-    precision[i], recall[i], _ = precision_recall_curve(y_test[:, i],
+    precision[i], recall[i], _ = precision_recall_curve(Y_test[:, i],
                                                         y_score[:, i])
-    average_precision[i] = average_precision_score(y_test[:, i], y_score[:, i])
+    average_precision[i] = average_precision_score(Y_test[:, i], y_score[:, i])
 
-# Compute micro-average ROC curve and ROC area
-precision["micro"], recall["micro"], _ = precision_recall_curve(y_test.ravel(),
+# A "micro-average": quantifying score on all classes jointly
+precision["micro"], recall["micro"], _ = precision_recall_curve(Y_test.ravel(),
     y_score.ravel())
-average_precision["micro"] = average_precision_score(y_test, y_score,
+average_precision["micro"] = average_precision_score(Y_test, y_score,
                                                      average="micro")
+print('Average precision score, micro-averaged over all classes: {0:0.2f}'
+      .format(average_precision["micro"]))
+
+###############################################################################
+# Plot the micro-averaged Precision-Recall curve
+# ...............................................
+#
 
+plt.figure()
+plt.step(recall['micro'], precision['micro'], c
A3D4
olor='b', alpha=0.2,
+         where='post')
+plt.fill_between(recall["micro"], precision["micro"], step='post', alpha=0.2,
+                 color='b')
 
-# Plot Precision-Recall curve
-plt.clf()
-plt.plot(recall[0], precision[0], lw=lw, color='navy',
-         label='Precision-Recall curve')
 plt.xlabel('Recall')
 plt.ylabel('Precision')
 plt.ylim([0.0, 1.05])
 plt.xlim([0.0, 1.0])
-plt.title('Precision-Recall example: AUC={0:0.2f}'.format(average_precision[0]))
-plt.legend(loc="lower left")
-plt.show()
+plt.title(
+    'Average precision score, micro-averaged over all classes: AUC={0:0.2f}'
+    .format(average_precision["micro"]))
 
+###############################################################################
 # Plot Precision-Recall curve for each class and iso-f1 curves
-plt.clf()
+# .............................................................
+#
+from itertools import cycle
+# setup plot details
+colors = cycle(['navy', 'turquoise', 'darkorange', 'cornflowerblue', 'teal'])
+
+plt.figure(figsize=(7, 8))
 f_scores = np.linspace(0.2, 0.8, num=4)
 lines = []
 labels = []
@@ -152,23 +238,102 @@
 
 lines.append(l)
 labels.append('iso-f1 curves')
-l, = plt.plot(recall["micro"], precision["micro"], color='gold', lw=lw)
+l, = plt.plot(recall["micro"], precision["micro"], color='gold', lw=2)
 lines.append(l)
-labels.append('micro-average Precision-recall curve (area = {0:0.2f})'
+labels.append('micro-average Precision-recall (area = {0:0.2f})'
               ''.format(average_precision["micro"]))
+
 for i, color in zip(range(n_classes), colors):
-    l, = plt.plot(recall[i], precision[i], color=color, lw=lw)
+    l, = plt.plot(recall[i], precision[i], color=color, lw=2)
     lines.append(l)
-    labels.append('Precision-recall curve of class {0} (area = {1:0.2f})'
+    labels.append('Precision-recall for class {0} (area = {1:0.2f})'
                   ''.format(i, average_precision[i]))
 
 fig = plt.gcf()
-fig.set_size_inches(7, 7)
 fig.subplots_adjust(bottom=0.25)
 plt.xlim([0.0, 1.0])
 plt.ylim([0.0, 1.05])
 plt.xlabel('Recall')
 plt.ylabel('Precision')
 plt.title('Extension of Precision-Recall curve to multi-class')
-plt.figlegend(lines, labels, loc='lower center')
+plt.legend(lines, labels, loc=(0, -.38), prop=dict(size=14))
+
+
+###############################################################################
+# Eleven-point average precision
+# ------------------------------
+#
+# In *interpolated* average precision, a set of desired recall values is
+# specified and for each desired value we average the best precision
+# scores possible with a recall value at least equal to the target value.
+# The most common choice is 'eleven point' interpolated precision, where
+# the desired recall values are [0, 0.1, 0.2, ..., 1.0]. This is the
+# metric referenced in `The PASCAL Visual Object Classes (VOC) Challenge
+# <http://citeseerx.ist.psu.edu
+# /viewdoc/download?doi=10.1.1.157.5766&rep=rep1&type=pdf>`_ (top of page
+# 11, formula 1). In the example below, the eleven precision values are
+# indicated with an arrow to pointing to the best precision possible
+# while meeting or exceeding the desired recall. Note that it's possible
+# that the same operating point might correspond to multiple desired
+# recall values.
+
+from operator import itemgetter
+
+
+def pick_eleven_points(recall_, precision_):
+    """Choose the eleven operating points that correspond
+    to the best precision for any ``recall >= r`` for r in
+    [0, 0.1, 0.2, ..., 1.0]
+    """
+    operating_points = list()
+    for target_recall in np.arange(0, 1.1, 0.1):
+        operating_points_to_consider = [pair
+                                        for pair in zip(recall_, precision_)
+                                        if pair[0] >= target_recall]
+        operating_points.append(max(operating_points_to_consider,
+                                    key=itemgetter(1)))
+    return operating_points
+
+# Work on the 2nd class of iris
+iris_cls = 2
+
+eleven_points = pick_eleven_points(recall[iris_cls], precision[iris_cls])
+interpolated_average_precision = np.mean([e[1] for e in eleven_points])
+
+
+print("Target recall    Selected recall   Precision")
+for i in range(11):
+    print("  >= {} {: >12.3f} {: >12.3f}".format(i / 10,
+          *eleven_points[i]))
+
+print("  Average:{: >22.3f}".format(interpolated_average_precision))
+
+###############################################################################
+# Plot illustrating eleven-point average precision
+# .................................................
+
+plt.figure(figsize=(7, 7))
+plt.step(recall[iris_cls], precision[iris_cls], color='g', where='post',
+         alpha=0.5, linewidth=2,
+         label='Precision-recall curve of class {0} (area = {1:0.2f})'
+               ''.format(iris_cls, average_precision[iris_cls]))
+
+plt.fill_between(recall[iris_cls], precision[iris_cls], step='post', alpha=0.1,
+                 color='g')
+for i in range(11):
+    plt.annotate('',
+                 xy=(eleven_points[i][0], eleven_points[i][1]),
+                 xycoords='data', xytext=(i / 10., 0), textcoords='data',
+                 arrowprops=dict(arrowstyle="->", alpha=0.7,
+                                 connectionstyle="angle3,angleA=90,angleB=45"))
+
+
+plt.xlim([0.0, 1.0])
+plt.ylim([0.0, 1.05])
+plt.xticks(np.arange(0, 1.1, 0.1))
+plt.xlabel('Recall')
+plt.ylabel('Precision')
+plt.title('Eleven point Precision Recall for class\\n {}'.format(iris_cls))
+plt.legend(loc="upper right")
+
 plt.show()
-Original file line number
+Diff line change
@@ Expand Up / @@ -681,7 +681,7 @@ Here are some small examples in binary classification:: @@
       >>> threshold
       array([ 0.35,  0.4 ,  0.8 ])
       >>> average_precision_score(y_true, y_scores)  # doctest: +ELLIPSIS
-.79...
+.83...
@@ Expand Down @@