-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG] Bug fix and new feature: fix implementation of average precision score and add eleven-point interpolated option #7356
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
sklearn/metrics/ranking.py
Outdated
interpolation : string ['linear' (default), 'step'] | ||
Determines the kind of interpolation used when computed AUC. If there | ||
are many repeated scores, 'step' is recommended to avoid under- or | ||
over-estimating the AUC. See `Roam Analytics blog post |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem a good source to me. I'm happy with citing the IR book, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem - I'll remove it and add the IR book to the references.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amueller Related question: in the references section, is it okay to include the link to my blog post? I'm never sure whether the references are to justify the implementation, or to provide useful guidance for users. I feel like my post might be helpful for the latter but not for the former!
You can also cite the Pascal VOC paper. Can you please also add a reference to the paper I cited earlier about the relation of ROC curve and PR curve? |
Wait, so currently we are using linear interpolation? That is actually... wrong... see the paper I cited. |
Could you please do a comparison of the VOC measure, the measure you implemented, the measure in the IR book and our current measure? Also, can you maybe add an example that compares the different methods. This is pretty subtle. For PR, there is "linear interpolation" (which we currently do and is wrong), "VOC smoothing" which is even worse, "step-wise interpolation" which is a lower bound and the correct interpolation as explained in the paper, which requires a validation set. |
The IR book is also cheating (using the VOC method), the blog post is not cheating as far as I can see. |
Oh the blog post is by you? Then never mind my comment there ;) So interestingly AP as defined on wikipedia is again different from this. mean AP is the mean over recall@k for all k. That means some of the sawtooth are skipped over, but not all. |
sklearn/metrics/ranking.py
Outdated
@@ -55,6 +57,25 @@ def auc(x, y, reorder=False): | |||
If True, assume that the curve is ascending in the case of ties, as for | |||
an ROC curve. If the curve is non-ascending, the result will be wrong. | |||
|
|||
interpolation : string ['trapezoid' (default), 'step'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
linear, not trapezoid ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
Ok so after thinking about it for another hour: you are implementing the wikipedia definition. We should not allow "linear" interpolation for So I'd add a parameter to I wouldn't worry about AUC for now. |
Before I get on this flight, I want to clarify your last comment so I can spend some flight time working on it! I haven't read the paper yet, so apologies if some of this would be pre-empted by that. Interpolation options
Comments:'None' falls into the same pitfalls as linear interpolation. If we only have two operating points, the arithmetic mean of the precision values is the midpoint of the straight line joining them. That's what I was trying to escape! This problem doesn't apply to the 11PIAP because the recall values are specified, so although we're taking an arithmetic mean, we're at least comparing oranges and oranges. But 11PIAP does introduce a different problem. Consider a model that assigns scores which permit recall values of 0.09, 0.19, 0.29, etc. This model is likely to appear to be much worse than one whose recall values are 0.11, 0.21, 0.31, etc. even if the first one has much better precision at those operating points. We could imagine a situation where including examples at random would actually improve the 11PIAP by bringing the recall above the arbitrary thresholds. The step interpolation I proposed is really just an extension of the 11PIAP to arbitrarily many points. It doesn't introduce the bias that linear interpolation does, and it allows each operating point to fully contribute whatever its recall value is. So I'd still advocate for including that as an option! |
By "none" I meant what you did, i.e. step-wise interpolation to the left. it's not really interpolation, though. I hope I'll have some plots later today to illustrate. |
Oh, that makes sense. I call it 'interpolated' because I'm thinking of it from a graphical perspective, but I see that the wiki article comes to the same calculation without explicitly considering the curve. I'll recode it so that the wikipedia definition ($\operatorname{AveP} = \sum_{k=1}^n P(k) \Delta r(k)$) is the default and I'll add an option to use Presumably I'll need to change some tests since we'll be changing the default behavior. Is there a best-practice way to do that? |
check which tests fail, fix them. add a test that checks that the current behavior is correct, not like the previous one, possibly using one of the reported bugs. |
please also check #4936 to see if you can reuse any of that. |
sorry accidental close. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a non-regression test. Ideally it would be great if you could have a test against the output of the Pascal VOC code.
sklearn/metrics/ranking.py
Outdated
Uses a step function to interpolate between operating points. | ||
interpolation : None (default), or string ['eleven_point'] | ||
|
||
``None``: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you check how this renders / whether sphinx spits out warnings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sklearn/metrics/ranking.py
Outdated
Linearly interpolates between operating points. | ||
``'step'``: | ||
Uses a step function to interpolate between operating points. | ||
interpolation : None (default), or string ['eleven_point'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why a list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was just copying the style from other places that includes acceptable strings in a list. But they had multiple options so I guess that made more sense. Should it read interpolation : None (default), or string 'eleven_point'
or interpolation : None (default), or string ['eleven_point', ]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the first, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. See the screenshot below.
Returns | ||
------- | ||
average_precision : float | ||
|
||
References | ||
---------- | ||
.. [1] `Wikipedia entry for the Average precision | ||
<https://en.wikipedia.org/wiki/Average_precision>`_ | ||
<http://en.wikipedia.org/wiki/Average_precision>`_ | ||
.. [2] `Stanford Information Retrieval book |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add the Pascal VOC paper?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one? Page 11 describes the metric. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.157.5766&rep=rep1&type=pdf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
sklearn/metrics/ranking.py
Outdated
``'eleven_point'``: | ||
For each of the recall values, r in {0, 0.1, 0.2, ..., 1.0}, | ||
compute the arithmetic mean of the first precision value with a | ||
corresponding recall ≥ r. See the `Stanford Information Retrieval |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the >=
symbol is probably not a good idea here (you could declare an encoding for the file but I think just writing >= is better).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@amueller is there a python implementation of Pascal VOC? The Matlab implementation is in this comment, but I can't find a Python version. I'd write one, but it's likely to be identical to what the implementation here so that's probably not helpful for testing purposes! |
There's no python version. Would you mind running the matlab version? There could be subtle differences. |
I don't have either matlab or octave. If you/someone else can run it and paste the |
@ndingwall I'm back! test are failing. For testing the 11-point average, I would probably like to test against actual outcomes of the Pascal VOC code, since that is de-facto the definition of this measure (because defining scientific evaluation measures with matlab code is such a good idea!). |
@@ -6,6 +6,34 @@ | |||
Release history | |||
=============== | |||
|
|||
Version 0.18.2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This version will not happen, we'll do 0.19 soonish?
@amueller The CircleCI output seems to stop with this error so I'm not seeing outputs of tests, but I had a problem before that my local results don't match what CircleCI is reporting.
Do you have an example? I don't have Matlab so can't generate any. |
If you merge with current master, you'll not only have an opportunity to fix current conflicts, but find that there is a section for 0.19 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've not reviewed it all, but I don't see how we can make a fundamental change to basically every evaluation of a metric without having a deprecation cycle or something even more drastic like giving the new implementations new function names. People have, no doubt, reported experimental results with the incumbent implementation and will find their measures suddenly and inexplicably changed on update. I don't think that's okay.
I'm pretty sure I recall an earlier version of this maintaining current behaviour, and I think we need to keep that, with a warning, for some time; or else warn loudly that it has changed but instruct the user how to silence the warning.
/viewdoc/download?doi=10.1.1.157.5766&rep=rep1&type=pdf>`_. In the example | ||
below, the eleven precision values are indicated with an arrow to pointing to | ||
the best precision possible while meeting or exceeding the desired recall. | ||
Note that it's possible that the same operating point might correspond to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is in contradiction to the definition of operating point above. Do you simply mean that precision may be identical for multiple recall values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it's that there's no requirement that the eleven operating points chosen must be distinct: e.g. the points corresponding to recalls of 0.2, 0.3 and 0.4 might all be the same operating point at e.g. (0.45, 0.6) if 0.6 is the best precision you can get with a recall of at least 0.2.
- :func:`metrics.ranking.average_precision_score` no longer linearly | ||
interpolates between operating points, and instead weights precisions | ||
by the change in recall since the last operating point, as per the | ||
`Wikipedia entry <http://en.wikipedia.org/wiki/Average_precision>`_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need to cite a more stable reference; at a minimum, a versioned Wikipedia page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏻
Can you maybe fix the conflicts? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good apart from the minor nitpick, I think. Trying to get @agramfort to dig up his matlab.
0.79... | ||
0.83... | ||
|
||
>>> yt = np.array([0, 0, 1, 1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could also just use the y_true and y_score variables defined above, instead of redefining them here?
I didn't realize Pascal VOC uses the standard AP, not 11 point. Is there a reference implementation for the 11 point somewhere? |
I did:
then in matlab using VOC code:
returns
|
In the interest of getting this merged while the heat of the sprint lasts, I am taking over this PR and finishing up the last details. |
@amueller The Pascal VOC documentation is very clear that they're using 11-point AP in their evaluation (top right of page 11). It's not dated so this might have changed - do you have the actual Matlab code used in their evaluation? Will you or @GaelVaroquaux respond to @jnothman's comment above? |
I want to heavily rewrite the example. It's gotten way too complicated for demoing the average precision. I want to try to break it down more (and maybe do slightly less pretty plots). |
@ndingwall http://host.robots.ox.ac.uk/pascal/VOC/voc2012/VOCdevkit_18-May-2011.tar is the matlab code... hm |
@amueller, it looks like they're using interpolated average precision, just not 11-point interpolated. This loop propagates the best precision backwards iteratively from the highest recall value.
But either way, this contradicts the Pascal VOC documentation I've linked to in the docstrings! Do you know if this was updated somewhere? |
@GaelVaroquaux: agreed. It's hard to find the balance between providing examples of all functionality, and avoiding unnecessary complication. You're better placed than me to strike that balance! In any case, it looks like we might be able to abandon 11-point interpolated AP, although we might have to replace it with interpolated AP which suffers the same weird peeking-at-the-test-set-performance problem that 11-point IAP does. |
So, it seems that on different datasets, we do not reproduce well the pascal VOC measure. I am not sure why |
@GaelVaroquaux, that's probably because I based the code on this documentation, which doesn't correspond to their evaluation code. This calls for 11-point interpolated AP, but the code does straight interpolated AP (no fixing of recall values). I could probably amend the code to reflect their code, but it'd be good to also update the link to the documentation. I just tried searching for an updated version of the documentation but the host.robots.ox.ac.uk website seems to be down at the moment... |
@ndingwall : do you mean that it should match our "interpolation=None"? That's not what I am observing :$ |
no, interpolation=None is pessimistic, while I think theirs is optimistic (aka cheating) |
@GaelVaroquaux - as @amueller said, there's does use interpolation, just not the 11-point variety. We don't have an implementation that matches their evaluation code but it should be easy to write. |
@ndingwall : understood. I confirm looking at their code. It doesn't seem to be useful to try to replicate it. |
@GaelVaroquaux, the interpolated version is requested in #4577. I don't mind removing it (and the 11-point version) because, like @amueller, I consider them both to be cheating! But we could argue that the code should reflect what the community uses rather than what we think is best... |
Closing in favor of #9091 |
@GaelVaroquaux you have closed this issue and stated in #9091 that you won't be finishing that PR. Can we please re-open and resolve this issue? |
@vene, it seems you and @GaelVaroquaux discussed 11-point average precision
at a sprint last year and thought it was problematic. Could you please
enlighten us?
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Collation of related issues in order to better improve the PR.
|
||
else: | ||
raise ValueError("interpolation has to be one of " | ||
"(None, 'eleven_point').") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- We also need an option for the VOC variant. @ndingwall you mentioned that you could write it easily. Can you please add it in?
- It would be nice to abstract the
_binary_*_average_precision
functions into a variable and have a single call to_average_binary_score
. It would help make the code easier to read.
# We need the recall values in ascending order, and to ignore the first | ||
# (precision, recall) pair with precision = 1. | ||
precision = precision[-2::-1] | ||
recall = recall[-2::-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would make sense to fix line 470 in precision_recall_curve
instead of doing this. Please refer to #8280
Reference Issue
Fixes #4577 and #6377
What does this implement/fix? Explain your changes.
This adds an optional
interpolation
parameter to bothauc
andaverage_precision_score
. By default, the value is set to'linear'
which replicates the existing behavior, but there is also a'step'
option that implements the strategy described in Stanford's Introduction to Information Retrieval, and as discussed in this blog post. To generalize this,auc
also offers a parameter to control the direction of the step-wise interpolation; I'm sure someone will find this useful at some point!Any other comments?
At some point, I'd support switching the standard behavior to step-wise interpolation, but that will make the existing docstring tests fail. I don't want to change them without the support of the community!
This change is