8000 [WIP] Eleven point average precision by GaelVaroquaux · Pull Request #9091 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[WIP] Eleven point average precision #9091

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion doc/modules/model_evaluation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -681,7 +681,7 @@ Here are some small examples in binary classification::
>>> threshold
array([ 0.35, 0.4 , 0.8 ])
>>> average_precision_score(y_true, y_scores) # doctest: +ELLIPSIS
0.79...
0.83...



Expand Down
9 changes: 8 additions & 1 deletion doc/whats_new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ Release history
===============

Version 0.19
============
==============

**In Development**

Expand Down Expand Up @@ -193,6 +193,13 @@ Enhancements
Bug fixes
.........

- :func:`metrics.ranking.average_precision_score` no longer linearly
interpolates between operating points, and instead weighs precisions
by the change in recall since the last operating point, as per the
`Wikipedia entry <http://en.wikipedia.org/wiki/Average_precision>`_.
(`#7356 <https://github.com/scikit-learn/scikit-learn/pull/7356>`_). By
`Nick Dingwall`_ and `Gael Varoquaux`_.

- Fixed a bug in :class:`sklearn.covariance.MinCovDet` where inputting data
that produced a singular covariance matrix would cause the helper method
`_c_step` to throw an exception.
Expand Down
271 changes: 218 additions & 53 deletions examples/model_selection/plot_precision_recall.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,18 @@

Example of Precision-Recall metric to evaluate classifier output quality.

In information retrieval, precision is a measure of result relevancy, while
recall is a measure of how many truly relevant results are returned. A high
area under the curve represents both high recall and high precision, where high
precision relates to a low false positive rate, and high recall relates to a
low false negative rate. High scores for both show that the classifier is
returning accurate results (high precision), as well as returning a majority of
all positive results (high recall).
Precision-Recall is a useful measure of success of prediction when the
classes are very imbalanced. In information retrieval, precision is a
measure of result relevancy, while recall is a measure of how many truly
relevant results are returned.

The precision-recall curve shows the tradeoff between precision and
recall for different threshold. A high area under the curve represents
both high recall and high precision, where high precision relates to a
low false positive rate, and high recall relates to a low false negative
rate. High scores for both show that the classifier is returning accurate
results (high precision), as well as returning a majority of all positive
results (high recall).

A system with high recall but low precision returns many results, but most of
its predicted labels are incorrect when compared to the training labels. A
Expand All @@ -37,7 +42,7 @@

:math:`F1 = 2\\frac{P \\times R}{P+R}`

It is important to note that the precision may not decrease with recall. The
Note that the precision may not decrease with recall. The
definition of precision (:math:`\\frac{T_p}{T_p + F_p}`) shows that lowering
the threshold of a classifier may increase the denominator, by increasing the
number of results returned. If the threshold was previously set too high, the
Expand All @@ -54,11 +59,20 @@
The relationship between recall and precision can be observed in the
stairstep area of the plot - at the edges of these steps a small change
in the threshold considerably reduces precision, with only a minor gain in
recall. See the corner at recall = .59, precision = .8 for an example of this
phenomenon.
recall.

**Average precision** summarizes such a plot as the weighted mean of precisions
achieved at each threshold, with the increase in recall from the previous
threshold used as the weight:

:math:`\\text{AP} = \\sum_n (R_n - R_{n-1}) P_n`

where :math:`P_n` and :math:`R_n` are the precision and recall at the
nth threshold. A pair :math:`(R_k, P_k)` is referred to as an
*operating point*.

Precision-recall curves are typically used in binary classification to study
the output of a classifier. In order to extend Precision-recall curve and
the output of a classifier. In order to extend the precision-recall curve and
average precision to multi-class or multi-label classification, it is necessary
to binarize the output. One curve can be drawn per label, but one can also 10000 draw
a precision-recall curve by considering each element of the label indicator
Expand All @@ -71,76 +85,148 @@
:func:`sklearn.metrics.precision_score`,
:func:`sklearn.metrics.f1_score`
"""
print(__doc__)

import matplotlib.pyplot as plt
import numpy as np
from itertools import cycle
from __future__ import print_function

###############################################################################
# In binary classification settings
# --------------------------------------------------------
#
# Create simple data
# ..................
#
# Try to differentiate the two first classes of the iris data
from sklearn import svm, datasets
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import average_precision_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import label_binarize
from sklearn.multiclass import OneVsRestClassifier
import numpy as np

# import some data to play with
iris = datasets.load_iris()
X = iris.data
y = iris.target

# setup plot details
colors = cycle(['navy', 'turquoise', 'darkorange', 'cornflowerblue', 'teal'])
lw = 2

# Binarize the output
y = label_binarize(y, classes=[0, 1, 2])
n_classes = y.shape[1]

# Add noisy features
random_state = np.random.RandomState(0)
n_samples, n_features = X.shape
X = np.c_[X, random_state.randn(n_samples, 200 * n_features)]

# Limit to the two first classes, and split into training and test
X_train, X_test, y_train, y_test = train_test_split(X[y < 2], y[y < 2],
test_size=.5,
random_state=random_state)

# Create a simple classifier
classifier = svm.LinearSVC(random_state=random_state)
classifier.fit(X_train, y_train)
y_score = classifier.decision_function(X_test)

###############################################################################
# Compute the average precision score
# ...................................
from sklearn.metrics import average_precision_score
average_precision = average_precision_score(y_test, y_score)

print('Average precision-recall score: {0:0.2f}'.format(
average_precision))

###############################################################################
# Plot the Precision-Recall curve
# ................................
from sklearn.metrics import precision_recall_curve
import matplotlib.pyplot as plt

precision, recall, _ = precision_recall_curve(y_test, y_score)

plt.step(recall, precision, color='b', alpha=0.2,
where='post')
plt.fill_between(recall, precision, step='post', alpha=0.2,
color='b')

plt.xlabel('Recall')
plt.ylabel('Precision')
plt.ylim([0.0, 1.05])
plt.xlim([0.0, 1.0])
plt.title('2-class Precision-Recall curve: AUC={0:0.2f}'.format(
average_precision))

###############################################################################
# In multi-label settings
# ------------------------
#
# Create multi-label data, fit, and predict
# ...........................................
#
# We create a multi-label dataset, to illustrate the precision-recall in
# multi-label settings

from sklearn.preprocessing import label_binarize

# Use label_binarize to be multi-label like settings
Y = label_binarize(y, classes=[0, 1, 2])
n_classes = Y.shape[1]

# Split into training and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5,
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=.5,
random_state=random_state)

# We use OneVsRestClassifier for multi-label prediction
from sklearn.multiclass import OneVsRestClassifier

# Run classifier
classifier = OneVsRestClassifier(svm.SVC(kernel='linear', probability=True,
random_state=random_state))
y_score = classifier.fit(X_train, y_train).decision_function(X_test)
classifier = OneVsRestClassifier(svm.LinearSVC(random_state=random_state))
classifier.fit(X_train, Y_train)
y_score = classifier.decision_function(X_test)


# Compute Precision-Recall and plot curve
###############################################################################
# The average precision score in multi-label settings
# ....................................................
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import average_precision_score

# For each class
precision = dict()
recall = dict()
average_precision = dict()
for i in range(n_classes):
precision[i], recall[i], _ = precision_recall_curve(y_test[:, i],
precision[i], recall[i], _ = precision_recall_curve(Y_test[:, i],
y_score[:, i])
average_precision[i] = average_precision_score(y_test[:, i], y_score[:, i])
average_precision[i] = average_precision_score(Y_test[:, i], y_score[:, i])

# Compute micro-average ROC curve and ROC area
precision["micro"], recall["micro"], _ = precision_recall_curve(y_test.ravel(),
# A "micro-average": quantifying score on all classes jointly
precision["micro"], recall["micro"], _ = precision_recall_curve(Y_test.ravel(),
y_score.ravel())
average_precision["micro"] = average_precision_score(y_test, y_score,
average_precision["micro"] = average_precision_score(Y_test, y_score,
average="micro")
print('Average precision score, micro-averaged over all classes: {0:0.2f}'
.format(average_precision["micro"]))

###############################################################################
# Plot the micro-averaged Precision-Recall curve
# ...............................................
#

plt.figure()
plt.step(recall['micro'], precision['micro'], c A3D4 olor='b', alpha=0.2,
where='post')
plt.fill_between(recall["micro"], precision["micro"], step='post', alpha=0.2,
color='b')

# Plot Precision-Recall curve
plt.clf()
plt.plot(recall[0], precision[0], lw=lw, color='navy',
label='Precision-Recall curve')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.ylim([0.0, 1.05])
plt.xlim([0.0, 1.0])
plt.title('Precision-Recall example: AUC={0:0.2f}'.format(average_precision[0]))
plt.legend(loc="lower left")
plt.show()
plt.title(
'Average precision score, micro-averaged over all classes: AUC={0:0.2f}'
.format(average_precision["micro"]))

###############################################################################
# Plot Precision-Recall curve for each class and iso-f1 curves
plt.clf()
# .............................................................
#
from itertools import cycle
# setup plot details
colors = cycle(['navy', 'turquoise', 'darkorange', 'cornflowerblue', 'teal'])

plt.figure(figsize=(7, 8))
f_scores = np.linspace(0.2, 0.8, num=4)
lines = []
labels = []
Expand All @@ -152,23 +238,102 @@

lines.append(l)
labels.append('iso-f1 curves')
l, = plt.plot(recall["micro"], precision["micro"], color='gold', lw=lw)
l, = plt.plot(recall["micro"], precision["micro"], color='gold', lw=2)
lines.append(l)
labels.append('micro-average Precision-recall curve (area = {0:0.2f})'
labels.append('micro-average Precision-recall (area = {0:0.2f})'
''.format(average_precision["micro"]))

for i, color in zip(range(n_classes), colors):
l, = plt.plot(recall[i], precision[i], color=color, lw=lw)
l, = plt.plot(recall[i], precision[i], color=color, lw=2)
lines.append(l)
labels.append('Precision-recall curve of class {0} (area = {1:0.2f})'
labels.append('Precision-recall for class {0} (area = {1:0.2f})'
''.format(i, average_precision[i]))

fig = plt.gcf()
fig.set_size_inches(7, 7)
fig.subplots_adjust(bottom=0.25)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Extension of Precision-Recall curve to multi-class')
plt.figlegend(lines, labels, loc='lower center')
plt.legend(lines, labels, loc=(0, -.38), prop=dict(size=14))


###############################################################################
# Eleven-point average precision
# ------------------------------
#
# In *interpolated* average precision, a set of desired recall values is
# specified and for each desired value we average the best precision
# scores possible with a recall value at least equal to the target value.
# The most common choice is 'eleven point' interpolated precision, where
# the desired recall values are [0, 0.1, 0.2, ..., 1.0]. This is the
# metric referenced in `The PASCAL Visual Object Classes (VOC) Challenge
# <http://citeseerx.ist.psu.edu
# /viewdoc/download?doi=10.1.1.157.5766&rep=rep1&type=pdf>`_ (top of page
# 11, formula 1). In the example below, the eleven precision values are
# indicated with an arrow to pointing to the best precision possible
# while meeting or exceeding the desired recall. Note that it's possible
# that the same operating point might correspond to multiple desired
# recall values.

from operator import itemgetter


def pick_eleven_points(recall_, precision_):
"""Choose the eleven operating points that correspond
to the best precision for any ``recall >= r`` for r in
[0, 0.1, 0.2, ..., 1.0]
"""
operating_points = list()
for target_recall in np.arange(0, 1.1, 0.1):
operating_points_to_consider = [pair
for pair in zip(recall_, precision_)
if pair[0] >= target_recall]
operating_points.append(max(operating_points_to_consider,
key=itemgetter(1)))
return operating_points

# Work on the 2nd class of iris
iris_cls = 2

eleven_points = pick_eleven_points(recall[iris_cls], precision[iris_cls])
interpolated_average_precision = np.mean([e[1] for e in eleven_points])


print("Target recall Selected recall Precision")
for i in range(11):
print(" >= {} {: >12.3f} {: >12.3f}".format(i / 10,
*eleven_points[i]))

print(" Average:{: >22.3f}".format(interpolated_average_precision))

###############################################################################
# Plot illustrating eleven-point average precision
# .................................................

plt.figure(figsize=(7, 7))
plt.step(recall[iris_cls], precision[iris_cls], color='g', where='post',
alpha=0.5, linewidth=2,
label='Precision-recall curve of class {0} (area = {1:0.2f})'
''.format(iris_cls, average_precision[iris_cls]))

plt.fill_between(recall[iris_cls], precision[iris_cls], step='post', alpha=0.1,
color='g')
for i in range(11):
plt.annotate('',
xy=(eleven_points[i][0], eleven_points[i][1]),
xycoords='data', xytext=(i / 10., 0), textcoords='data',
arrowprops=dict(arrowstyle="->", alpha=0.7,
connectionstyle="angle3,angleA=90,angleB=45"))


plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xticks(np.arange(0, 1.1, 0.1))
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Eleven point Precision Recall for class\\n {}'.format(iris_cls))
plt.legend(loc="upper right")

plt.show()
Loading
0