[WIP] Eleven point average precision #9091

GaelVaroquaux · 2017-06-09T16:35:38Z

Continuation of #9017 (itself a continuation of #7356).

Adds an eleven-point interpolation method for average precision, as described in the IR book.

…_score

Remove the eleven average precision score Add better tests.

@ndingwall

Based on work by @ndingwall

GaelVaroquaux · 2017-06-09T16:36:18Z

I will not finish this PR, as I have no need for the eleven point. I am just doing the PR so that the code doesn't get lost. @ndingwall, if you're interest, please do take over.

vene · 2017-06-09T16:36:46Z

I wrote a simpler reference implementation which does not always agree with the one here, see this gist.

varunagrawal · 2018-02-25T02:07:56Z

@amueller @jnothman @ndingwall can we please make some progress on this PR? This issue has gotten a bit out of hand at this point.

jnothman · 2018-02-25T22:24:07Z

what is your response to @vene's gist?

varunagrawal · 2018-02-25T22:58:13Z

@jnothman the gist seems to be a to-the-point implementation of the AP metric defined in the Pascal VOC paper. I'd like to hear comments from @ndingwall.

ndingwall · 2018-02-26T01:20:18Z

I've took a look at @vene's gist and the discrepancies seem to come from weird behavior of np.arange. I can reproduce @vene's results when I run the gist locally, but if I replace the list on line 13 with np.arange(0, 0.1, 1.1) then the differences are all zero.

To check what was going on, I printed the out list(np.arange(0, 0.1, 1.1)) and got [0.0, 0.1, 0.2, 0.30000000000000004, 0.4, 0.5, 0.6000000000000001, 0.7000000000000001, 0.8, 0.9, 1.0]. To double check this was causing the problem, I verified that e.g. 0.3 appears as a recall value for one of the examples that had nonzero differences, but not for an example that had a difference of zero, and indeed this was the case.

Fixing that is probably out of scope for this PR!

ndingwall · 2018-02-26T01:38:20Z

I don't like interpolated average precision because it seems like cheating to me (as discussed in #7356); I only included it in the previous incarnation of this PR since it was an easy addition to a function I was already rewriting and might have satisfied #4577.

But as I noted in #7356 (comment), the Pascal VOC evaluation code doesn't actually use 11-point AP, and it's not clear whether we should be implementing something that agrees with their code, or something that agrees with their definition (on page 11 of the challenge documentation), or both. There's discussion of that in #4577.

In any case, given my views about p_interp being cheating, I'm not enthusiastic to implement it and so I'd be happy to close this PR. If there's strong consensus that sklearn contain an implementation, I'd be in favour of a separate function that makes explicit that it's intended to compute the two versions of the Pascal VOC evaluation. Then nobody would risk comparing AP to interpolated AP.

varunagrawal · 2018-02-26T01:56:16Z

@ndingwall speaking on behalf of the Computer Vision community, the VOC AP is still a widely used metric for reporting average precision. Here's the kicker: I took a look at Ross Girshick's AP code and the comments state that the interpolated version is the correct one and not the 11 pt version. You can see it here.

That being said, I think it would be immensely useful for scikit-learn to implement a Pascal VOC Average Precision metric since most modern workflows use Python. I agree with your point that having a separate function for VOC specific Average Precision computation (with clarifying documentation) would be the right thing to do.

I am of course willing to help see this through, both in terms of code as well as management.

@karpathy

There's also an old issue from @karpathy dealing with this. I guess this would help corroborate the need for the VOC AP metric in sklearn.

ndingwall · 2018-02-26T04:41:03Z

@varunagrawal that sounds good. I have implementations of both 11-point interpolation and the interpolation you linked to if you want them. 11-point is in the existing PR, and the other one is just something like np.mean([precision[i:].max() for i in range(len(precision))]) (operating on precision after we do precision = precision[-2::-1]). There's likely to be more efficient implementations though since this does a lot of redundant searching for the max over very similar sublists.

EDIT: here's a 5x faster implementation to compute an interpolated AP:

def interpolated_average_precision(y_true, y_scores):
    precision, _, _ = precision_recall_curve(y_true, y_scores)
    p_interp = [0]
    for i in range(len(precision) - 1):
        p_interp.append(max(p_interp[-1], precision[i]))
    return np.mean(p_interp[1:])

ndingwall · 2018-02-26T04:58:02Z

(Although you should hold off until @amueller / @jnothman / @GaelVaroquaux give their thoughts on making a new function).

jnothman · 2018-02-26T05:22:54Z

I am alright with including standard metrics if the fact that they are problematic is clearly noted and referenced

varunagrawal · 2018-03-05T02:09:20Z

@amueller @GaelVaroquaux ping!

varunagrawal · 2018-04-04T21:20:09Z

Any updates on this? It's been a month since the last activity on this issue.

jnothman · 2018-04-04T23:25:00Z

I don't think you should anticipate @GaelVaroquaux finishing this but you're welcome to open a PR completing the work and try persuade reviewers that it's sufficiently useful and well-being documented given the discussion here

GaelVaroquaux · 2018-04-05T05:46:42Z

(Although you should hold off until @amueller / @jnothman / @GaelVaroquaux give their thoughts on making a new function).

No strong thoughts. I am not against this, as long as it is clear, well documented, and tested. However, don't expect me to invest time in it. I am too busy and it's not high on my priority list.

adrinjalali · 2024-04-19T14:17:32Z

Seems like we're not really very enthusiastic about inclusion of the algorithm in sklearn anymore.

ndingwall and others added 3 commits June 8, 2017 17:56

Adds support for step-wise interpolation to auc and average_precision…

d969100

…_score

DOC: Simpler precision-recall example, remove 11pt

928d2d0

Remove the eleven average precision score Add better tests.

ENH: add eleven point average precision

05456fd

Based on work by @ndingwall

GaelVaroquaux mentioned this pull request Jun 9, 2017

[MRG] Bug fix and new feature: fix implementation of average precision score and add eleven-point interpolated option #7356

Closed

Copy link

Contributor

varunagrawal commented Feb 26, 2018

There's also an old issue from @karpathy dealing with this. I guess this would help corroborate the need for the VOC AP metric in sklearn.

varunagrawal mentioned this pull request Feb 26, 2018

Document changed behaviour of average_precision_score() #10676

Closed

amueller added the Stalled label Aug 5, 2019

github-actions bot added the module:metrics label Mar 2, 2020

Base automatically changed from master to main January 22, 2021 10:49

adrinjalali closed this Apr 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[WIP] Eleven point average precision #9091

[WIP] Eleven point average precision #9091

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[WIP] Eleven point average precision #9091

[WIP] Eleven point average precision #9091

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!