Precision Recall numbers computed by Scikits are not interpolated (non-standard) #4577

karpathy · 2015-04-12T01:13:48Z

Hi,

Scikit Learn seems to implement Precision Recall curves (and Average Precision values/AUC under PR curve) in a non-standard way, without documenting the discrepancy. The standard way of computing Precision Recall numbers is by interpolating the curve, as described here:
http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-ranked-retrieval-results-1.html
the motivation is to

Smooth out the kinks and reduce noise contribution to the score
In any practical application, if your PR curve ever went up, then you would strictly prefer to set your threshold there rather than at the original place (achieving both more precision and recall). Hence, people prefer to interpolate the curve, which better integrates out the threshold parameter and gives a more sensible estimate of the real performance.

This is also what standard code for Pascal VOC does and explain this in their writeup:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.157.5766&rep=rep1&type=pdf

VL_FEAT also has options for interpolation:
http://www.vlfeat.org/matlab/vl_pr.html
and as shown in their code here: https://github.com/vlfeat/vlfeat/blob/edc378a722ea0d79e29f4648a54bb62f32b22568/toolbox/plotop/vl_pr.m

The concern is that people using the scikit version will see incorrectly reported LOWER performance, than what they might see reported in other papers.

jnothman · 2015-04-12T06:59:05Z

"Non-standard" might depend on the particular field of application. I think it would be worth having an interpolation mode. Could you also comment on #4223?

karpathy · 2015-04-12T07:12:26Z

RE "Non-standard" granted. More precisely, the community-agreed-on gold standard at least in the chunk of Machine Learning and Computer Vision that I work in is the Pascal VOC evaluation code.

The Pascal VOC code is in Matlab, and works as follows. First, in VOCevalaction.m we have:

% compute precision/recall
[so,si]=sort(-out); % out are class confidences
tp=gt(si)>0; % true positives
fp=gt(si)<0; % false posities
fp=cumsum(fp);
tp=cumsum(tp);
rec=tp/sum(gt>0); % recalls
prec=tp./(fp+tp); % precisions

and then they call ap=VOCap(rec,prec);, which we can descend into:

function ap = VOCap(rec,prec)
mrec=[0 ; rec ; 1];
mpre=[0 ; prec ; 0];
for i=numel(mpre)-1:-1:1
    mpre(i)=max(mpre(i),mpre(i+1));
end
i=find(mrec(2:end)~=mrec(1:end-1))+1;
ap=sum((mrec(i)-mrec(i-1)).*mpre(i));

In particular note the padding with 0 and 1, and the interpolation (loop with max). The last part integrates the precision over the recalls.

glouppe · 2015-04-12T07:35:41Z

+1 for documenting and/or adding this variant. This is the kind of hidden pitfalls that users should be made more aware of.

amueller · 2015-04-13T21:27:01Z

+1 that probably makes the numbers in my papers better ^^

arjoly · 2015-07-08T13:08:34Z

Are we sure that this is correct way to interpolate the pr-curve?

I thought that the interpolation has to be done first in the roc-space then translated into as there is a one-to-one mapping between those curves. This is highlighted in this paper.

chiragnagpal · 2015-07-09T13:20:55Z

We could have multiple values for the interpolate parameter, that way we can add newer interpolation techniques ?

arjoly · 2015-07-10T11:58:42Z

ping @amueller, @glouppe @jnothman

amueller · 2015-07-11T16:32:51Z

Having multiple strategies is probably a good idea. I think VOC and vl_feat are pretty good references, but that is because I'm from the computer vision community. Other communities might have different conventions, and interpolating AUC in the precision/recall space does seem a bit odd.

amueller · 2015-07-11T16:33:03Z

[I haven't really put any thought into it though]

amueller · 2015-07-12T21:52:26Z

The IR book also seems to interpolate the precision / recall curve. So I think that should be the right thing do to? http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-ranked-retrieval-results-1.html

agramfort · 2015-07-14T07:45:36Z

fair enough

amueller · 2016-04-29T19:40:02Z

Actually, the pascal score is just pretty wrong:
http://pages.cs.wisc.edu/~jdavis/davisgoadrichcamera2.pdf

We can implement it as an option, but we should probably do the achievable interpolation by default, as explained in the paper above.

redst4r · 2016-08-05T07:59:15Z

As a sidenote, AUC for precision-recall curves is not a good measure of performance anyway, as argued by this NIPS paper from Peter Flach

amueller · 2016-09-08T19:55:02Z

@karpathy do you know of a defence of the smoothing in the IR book and Pascal? It assumes you know which thresholds are good on the test-set. You could pick thresholds on the validation set and then apply them to the test-sets but picking the thresholds on the test set seems like a pretty bad error.

amueller · 2016-09-08T20:38:55Z

So I think we should change the current behavior of average_precision to match wikipedia (which is not what you implemented), and add an interpolate option, and not allow linear interpolation.
The definition of average precision is not really the area under the empirical precision recall curve, it's the average over recall@k for all k. We should implement it like that, it's quite different from what we currently do.

brendano · 2017-02-02T19:58:23Z

FYI, we just came across these issues a month ago and had to read through the implementation to see what was going on. I think it's dangerous for the documentation to reference what is on a Wikipedia page to define something, since Wikipedia can change -- looking through the docs I personally found it very difficult to understand what it was doing.

The current sklearn implementation for average_precision seems to be calculating a trapezoidal area under the PR curve. Since that doesn't seem to be identical to at least the definition of AP I've seen, it might help to have an equation in the documentation saying what the implementation does.

(Obviously there are many different definitions of these things so it's not like one correct method can be chosen.)

GaelVaroquaux · 2017-06-09T22:57:55Z

Fixed by #9017

amueller · 2017-06-10T10:47:09Z

@GaelVaroquaux I don't think this was fixed by your PR, as it neither implemented the "interpolated" nor the "11 point interpolated" version.

amueller · 2017-06-10T10:52:04Z

On the off-chance that @karpathy is reading this: the IR book doesn't use interpolation from what I understand. vl_feat uses non-interpolated average precision by the default form the code you linked to.
The Pascal VOC code uses "interpolated" AP, while their paper said they use 11 point. We should probably add options for those two, but I don't think we should use either by default.

GaelVaroquaux · 2017-06-10T10:56:20Z

As a sidenote, AUC for precision-recall curves is not a good measure of performance anyway, as argued by this NIPS paper

I would argue that this paper doesn't make it point, as it seem to believe that F_1 or F_beta are useful metric, wereas in my experience they are not.

amueller · 2017-07-21T18:42:34Z

The nips paper linked above is quite nice and argues again that the VOC measure makes no sense ^^

agitter · 2017-08-17T00:10:41Z

I'm pleased to see that average_precision_score no longer linearly interpolates between operating points, but I found the updated documentation confusing. The referenced Stanford IR book and VOC paper implement interpolated versions that differ from the sklearn implementation. Adding the average precision formula as @brendano suggested or removing these references would make it easier to see what the function computes without looking at the source.

jnothman · 2017-08-17T02:43:35Z

You're right that @brendano's comment about clear documentation still stands. I wasn't heavily involved in this change, but I think there was a decision that the Pascal VOC metric had too many problems to be a default, even though we may want to provide it (PR welcome). A PR removing the VOC reference, fixing the Wikipedia link to a particular revision, and describing the metric both in the docstring and in doc/modules/model_evaluation.rst would be very welcome.

…

On 17 August 2017 at 10:10, Anthony Gitter ***@***.***> wrote: I'm pleased to see that average_precision_score no longer linearly interpolates between operating points, but I found the updated documentation confusing. The referenced Stanford IR book and VOC paper implement interpolated versions that differ from the sklearn implementation. Adding the average precision formula as @brendano <https://github.com/brendano> suggested or removing these references would make it easier to see what the function computes without looking at the source. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4577 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz63svzLRROyhuoUqdBGwP43w5ft7zks5sY4UEgaJpZM4D-xTX> .

adrinjalali · 2024-04-18T13:51:26Z

#9583 solves the last issues here IIUC.

amueller added Bug Documentation labels Apr 13, 2015

amueller added the Need Contributor label Jun 15, 2015

arjoly mentioned this issue Jul 9, 2015

[MRG] fixes #4577 adds interpolation to PR curve #4936

Closed

amueller removed the Need Contributor label Jul 11, 2015

amueller mentioned this issue Dec 10, 2015

Precision/Recall AUC scoring function #5992

Closed

amueller mentioned this issue Sep 6, 2016

error in average_precision_score #5379

Closed

ndingwall mentioned this issue Sep 7, 2016

[MRG] Bug fix and new feature: fix implementation of average precision score and add eleven-point interpolated option #7356

Closed

amueller modified the milestone: 0.19 Sep 29, 2016

GaelVaroquaux mentioned this issue Jun 6, 2017

[MRG] Bug fix and new feature: fix implementation of average precision score and add eleven-point interpolated option (7356 rebased) #9017

Closed

GaelVaroquaux closed this as completed Jun 9, 2017

amueller reopened this Jun 10, 2017

amueller modified the milestone: 0.19 Jun 12, 2017

amueller mentioned this issue Jul 21, 2017

[MRG+1] Bugfix for precision_recall_curve when all labels are negative #8280

Closed

agitter mentioned this issue Aug 18, 2017

[MRG+1] Add average precision definitions and cross references #9583

Merged

ndingwall mentioned this issue Feb 26, 2018

[WIP] Eleven point average precision #9091

Closed

rth mentioned this issue Aug 11, 2020

Interpolation of ROC and PR curve metrics #18135

Closed

cmarmo added the module:metrics label Dec 7, 2021

thomasjpfan mentioned this issue Apr 22, 2022

average_precision_score() overestimates AUC value #13074

Closed

lucyleeow added this to Quansight's scikit-learn Project Board Oct 24, 2023

github-project-automation bot moved this to Todo📬 in Quansight's scikit-learn Project Board Oct 24, 2023

adrinjalali closed this as completed Apr 18, 2024

github-project-automation bot moved this from Todo📬 to Done🚀 in Quansight's scikit-learn Project Board Apr 18, 2024

agitter mentioned this issue Jun 1, 2024

DOC add Chen2024 AUPRC reference to user guide #29155

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Precision Recall numbers computed by Scikits are not interpolated (non-standard) #4577

Precision Recall numbers computed by Scikits are not interpolated (non-standard) #4577

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Precision Recall numbers computed by Scikits are not interpolated (non-standard) #4577

Precision Recall numbers computed by Scikits are not interpolated (non-standard) #4577

Comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!