-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Precision Recall numbers computed by Scikits are not interpolated (non-standard) #4577
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
"Non-standard" might depend on the particular field of application. I think it would be worth having an interpolation mode. Could you also comment on #4223? |
RE "Non-standard" granted. More precisely, the community-agreed-on gold standard at least in the chunk of Machine Learning and Computer Vision that I work in is the Pascal VOC evaluation code. The Pascal VOC code is in Matlab, and works as follows. First, in % compute precision/recall
[so,si]=sort(-out); % out are class confidences
tp=gt(si)>0; % true positives
fp=gt(si)<0; % false posities
fp=cumsum(fp);
tp=cumsum(tp);
rec=tp/sum(gt>0); % recalls
prec=tp./(fp+tp); % precisions and then they call function ap = VOCap(rec,prec)
mrec=[0 ; rec ; 1];
mpre=[0 ; prec ; 0];
for i=numel(mpre)-1:-1:1
mpre(i)=max(mpre(i),mpre(i+1));
end
i=find(mrec(2:end)~=mrec(1:end-1))+1;
ap=sum((mrec(i)-mrec(i-1)).*mpre(i)); In particular note the padding with 0 and 1, and the interpolation (loop with max). The last part integrates the precision over the recalls. |
+1 for documenting and/or adding this variant. This is the kind of hidden pitfalls that users should be made more aware of. |
+1 that probably makes the numbers in my papers better ^^ |
Are we sure that this is correct way to interpolate the pr-curve? I thought that the interpolation has to be done first in the roc-space then translated into as there is a one-to-one mapping between those curves. This is highlighted in this paper. |
We could have multiple values for the interpolate parameter, that way we can add newer interpolation techniques ? |
Having multiple strategies is probably a good idea. I think VOC and vl_feat are pretty good references, but that is because I'm from the computer vision community. Other communities might have different conventions, and interpolating AUC in the precision/recall space does seem a bit odd. |
[I haven't really put any thought into it though] |
The IR book also seems to interpolate the precision / recall curve. So I think that should be the right thing do to? http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-ranked-retrieval-results-1.html |
fair enough |
Actually, the pascal score is just pretty wrong: We can implement it as an option, but we should probably do the achievable interpolation by default, as explained in the paper above. |
As a sidenote, AUC for precision-recall curves is not a good measure of performance anyway, as argued by this NIPS paper from Peter Flach |
@karpathy do you know of a defence of the smoothing in the IR book and Pascal? It assumes you know which thresholds are good on the test-set. You could pick thresholds on the validation set and then apply them to the test-sets but picking the thresholds on the test set seems like a pretty bad error. |
So I think we should change the current behavior of |
FYI, we just came across these issues a month ago and had to read through the implementation to see what was going on. I think it's dangerous for the documentation to reference what is on a Wikipedia page to define something, since Wikipedia can change -- looking through the docs I personally found it very difficult to understand what it was doing. The current sklearn implementation for (Obviously there are many different definitions of these things so it's not like one correct method can be chosen.) |
Fixed by #9017 |
@GaelVaroquaux I don't think this was fixed by your PR, as it neither implemented the "interpolated" nor the "11 point interpolated" version. |
On the off-chance that @karpathy is reading this: the IR book doesn't use interpolation from what I understand. vl_feat uses non-interpolated average precision by the default form the code you linked to. |
As a sidenote, AUC for precision-recall curves is not a good measure of
performance anyway, as argued by this NIPS paper
I would argue that this paper doesn't make it point, as it seem to
believe that F_1 or F_beta are useful metric, wereas in my experience
they are not.
|
The nips paper linked above is quite nice and argues again that the VOC measure makes no sense ^^ |
I'm pleased to see that |
You're right that @brendano's comment about clear documentation still
stands.
I wasn't heavily involved in this change, but I think there was a decision
that the Pascal VOC metric had too many problems to be a default, even
though we may want to provide it (PR welcome).
A PR removing the VOC reference, fixing the Wikipedia link to a particular
revision, and describing the metric both in the docstring and in
doc/modules/model_evaluation.rst would be very welcome.
…On 17 August 2017 at 10:10, Anthony Gitter ***@***.***> wrote:
I'm pleased to see that average_precision_score no longer linearly
interpolates between operating points, but I found the updated
documentation confusing. The referenced Stanford IR book and VOC paper
implement interpolated versions that differ from the sklearn
implementation. Adding the average precision formula as @brendano
<https://github.com/brendano> suggested or removing these references
would make it easier to see what the function computes without looking at
the source.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4577 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz63svzLRROyhuoUqdBGwP43w5ft7zks5sY4UEgaJpZM4D-xTX>
.
|
#9583 solves the last issues here IIUC. |
Hi,
Scikit Learn seems to implement Precision Recall curves (and Average Precision values/AUC under PR curve) in a non-standard way, without documenting the discrepancy. The standard way of computing Precision Recall numbers is by interpolating the curve, as described here:
http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-ranked-retrieval-results-1.html
the motivation is to
This is also what standard code for Pascal VOC does and explain this in their writeup:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.157.5766&rep=rep1&type=pdf
VL_FEAT also has options for interpolation:
http://www.vlfeat.org/matlab/vl_pr.html
and as shown in their code here: https://github.com/vlfeat/vlfeat/blob/edc378a722ea0d79e29f4648a54bb62f32b22568/toolbox/plotop/vl_pr.m
The concern is that people using the scikit version will see incorrectly reported LOWER performance, than what they might see reported in other papers.
The text was updated successfully, but these errors were encountered: