8000 Precision Recall numbers computed by Scikits are not interpolated (non-standard) · Issue #4577 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Precision Recall numbers computed by Scikits are not interpolated (non-standard) #4577

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
karpathy opened this issue Apr 12, 2015 · 24 comments
Closed

Comments

@karpathy
Copy link

Hi,

Scikit Learn seems to implement Precision Recall curves (and Average Precision values/AUC under PR curve) in a non-standard way, without documenting the discrepancy. The standard way of computing Precision Recall numbers is by interpolating the curve, as described here:
http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-ranked-retrieval-results-1.html
the motivation is to

  1. Smooth out the kinks and reduce noise contribution to the score
  2. In any practical application, if your PR curve ever went up, then you would strictly prefer to set your threshold there rather than at the original place (achieving both more precision and recall). Hence, people prefer to interpolate the curve, which better integrates out the threshold parameter and gives a more sensible estimate of the real performance.

This is also what standard code for Pascal VOC does and explain this in their writeup:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.157.5766&rep=rep1&type=pdf

VL_FEAT also has options for interpolation:
http://www.vlfeat.org/matlab/vl_pr.html
and as shown in their code here: https://github.com/vlfeat/vlfeat/blob/edc378a722ea0d79e29f4648a54bb62f32b22568/toolbox/plotop/vl_pr.m

The concern is that people using the scikit version will see incorrectly reported LOWER performance, than what they might see reported in other papers.

@jnothman
Copy link
Member

"Non-standard" might depend on the particular field of application. I think it would be worth having an interpolation mode. Could you also comment on #4223?

@karpathy
Copy link
Author

RE "Non-standard" granted. More precisely, the community-agreed-on gold standard at least in the chunk of Machine Learning and Computer Vision that I work in is the Pascal VOC evaluation code.

The Pascal VOC code is in Matlab, and works as follows. First, in VOCevalaction.m we have:

% compute precision/recall
[so,si]=sort(-out); % out are class confidences
tp=gt(si)>0; % true positives
fp=gt(si)<0; % false posities
fp=cumsum(fp);
tp=cumsum(tp);
rec=tp/sum(gt>0); % recalls
prec=tp./(fp+tp); % precisions

and then they call ap=VOCap(rec,prec);, which we can descend into:

function ap = VOCap(rec,prec)
mrec=[0 ; rec ; 1];
mpre=[0 ; prec ; 0];
for i=numel(mpre)-1:-1:1
    mpre(i)=max(mpre(i),mpre(i+1));
end
i=find(mrec(2:end)~=mrec(1:end-1))+1;
ap=sum((mrec(i)-mrec(i-1)).*mpre(i));

In particular note the padding with 0 and 1, and the interpolation (loop with max). The last part integrates the precision over the recalls.

@glouppe
Copy link
Contributor
glouppe commented Apr 12, 2015

+1 for documenting and/or adding this variant. This is the kind of hidden pitfalls that users should be made more aware of.

@amueller
Copy link
Member

+1 that probably makes the numbers in my papers better ^^

@arjoly
Copy link
Member
arjoly commented Jul 8, 2015

Are we sure that this is correct way to interpolate the pr-curve?

I thought that the interpolation has to be done first in the roc-space then translated into as there is a one-to-one mapping between those curves. This is highlighted in this paper.

@chiragnagpal
Copy link

We could have multiple values for the interpolate parameter, that way we can add newer interpolation techniques ?

@arjoly
Copy link
Member
arjoly commented Jul 10, 2015

ping @amueller, @glouppe @jnothman

@amueller
Copy link
Member

Having multiple strategies is probably a good idea. I think VOC and vl_feat are pretty good references, but that is because I'm from the computer vision community. Other communities might have different conventions, and interpolating AUC in the precision/recall space does seem a bit odd.

@amueller
Copy link
Member

[I haven't really put any thought into it though]

@amueller
Copy link
Member

The IR book also seems to interpolate the precision / recall curve. So I think that should be the right thing do to? http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-ranked-retrieval-results-1.html

@agramfort
Copy link
Member

fair enough

@amueller
Copy link
Member
amueller commented Apr 29, 2016

Actually, the pascal score is just pretty wrong:
http://pages.cs.wisc.edu/~jdavis/davisgoadrichcamera2.pdf

We can implement it as an option, but we should probably do the achievable interpolation by default, as explained in the paper above.

@redst4r
Copy link
redst4r commented Aug 5, 2016

As a sidenote, AUC for precision-recall curves is not a good measure of performance anyway, as argued by this NIPS paper from Peter Flach

@amueller
Copy link
Member
amueller commented Sep 8, 2016

@karpathy do you know of a defence of the smoothing in the IR book and Pascal? It assumes you know which thresholds are good on the test-set. You could pick thresholds on the validation set and then apply them to the test-sets but picking the thresholds on the test set seems like a pretty bad error.

@amueller
Copy link
Member
amueller commented Sep 8, 2016

So I think we should change the current behavior of average_precision to match wikipedia (which is not what you implemented), and add an interpolate option, and not allow linear interpolation.
The definition of average precision is not really the area under the empirical precision recall curve, it's the average over recall@k for all k. We should implement it like that, it's quite different from what we currently do.

@amueller amueller modified the milestone: 0.19 Sep 29, 2016
@brendano
Copy link
brendano commented Feb 2, 2017

FYI, we just came across these issues a month ago and had to read through the implementation to see what was going on. I think it's dangerous for the documentation to reference what is on a Wikipedia page to define something, since Wikipedia can change -- looking through the docs I personally found it very difficult to understand what it was doing.

The current sklearn implementation for average_precision seems to be calculating a trapezoidal area under the PR curve. Since that doesn't seem to be identical to at least the definition of AP I've seen, it might help to have an equation in the documentation saying what the implementation does.

(Obviously there are many different definitions of these things so it's not like one correct method can be chosen.)

@GaelVaroquaux
Copy link
Member

Fixed by #9017

@amueller
Copy link
Member

@GaelVaroquaux I don't think this was fixed by your PR, as it neither implemented the "interpolated" nor the "11 point interpolated" version.

@amueller amueller reopened this Jun 10, 2017
@amueller
Copy link
Member

On the off-chance that @karpathy is reading this: the IR book doesn't use interpolation from what I understand. vl_feat uses non-interpolated average precision by the default form the code you linked to.
The Pascal VOC code uses "interpolated" AP, while their paper said they use 11 point. We should probably add options for those two, but I don't think we should use either by default.

@GaelVaroquaux
Copy link
Member
GaelVaroquaux commented Jun 10, 2017 via email

@amueller
Copy link
Member

The nips paper linked above is quite nice and argues again that the VOC measure makes no sense ^^

@agitter
Copy link
Contributor
agitter commented Aug 17, 2017

I'm pleased to see that average_precision_score no longer linearly interpolates between operating points, but I found the updated documentation confusing. The referenced Stanford IR book and VOC paper implement interpolated versions that differ from the sklearn implementation. Adding the average precision formula as @brendano suggested or removing these references would make it easier to see what the function computes without looking at the source.

@jnothman
Copy link
Member
jnothman commented Aug 17, 2017 via email

@adrinjalali
Copy link
Member

#9583 solves the last issues here IIUC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
0