-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG+1] Bugfix for precision_recall_curve when all labels are negative #8280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
b39523f
to
92f129d
Compare
assert_raises(Exception, average_precision_score, y_true, y_score) | ||
# y_true = [0, 0] | ||
# y_score = [0.25, 0.75] | ||
# assert_raises(Exception, precision_recall_curve, y_true, y_score) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the behaviour change, it uses to warn because of a division by zero which was turned into an error by the np.seterr('raise')
context manager. Remove the np.seterr('raise')
and make sure that you test the result of precision_recall_curve
.
You don't need to add more tests for the edge cases since they were already tested, you just need to change what we expect from the results.
# y_score = np.array([[0, 1], [0, 1]]) | ||
# assert_raises(Exception, average_precision_score, y_true, y_score, | ||
# average="macro") | ||
# assert_raises(Exception, average_precision_score, y_true, y_score, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I have to say I am not sure what should happen in the multi-label case.
@@ -978,3 +979,27 @@ def test_ranking_loss_ties_handling(): | |||
assert_almost_equal(label_ranking_loss([[1, 0, 0]], [[0.25, 0.5, 0.5]]), 1) | |||
assert_almost_equal(label_ranking_loss([[1, 0, 1]], [[0.25, 0.5, 0.5]]), 1) | |||
assert_almost_equal(label_ranking_loss([[1, 1, 0]], [[0.25, 0.5, 0.5]]), 1) | |||
|
|||
|
|||
def test_precision_recall_curve_all_negatives(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can remove this test
assert_not_equal(recall[0], np.nan) | ||
|
||
|
||
def test_precision_recall_curve_all_positives(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can remove this test
@lesteve I've updated the tests. Will squash this PR once it gets the green light. |
f06b98c
to
50fbd04
Compare
From the travis logs, it seems the tests all pass but something else is failing, which I can't seem to understand. Can someone help with this? |
You'll get help, but we're moving through the issues quite slowly. Thanks
for your patience.
…On 5 February 2017 at 09:50, Varun Agrawal ***@***.***> wrote:
From the travis logs, it seems the tests all pass but something else is
failing, which I can't seem to understand. Can someone help with this?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#8280 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6wtV-GhfRkl2h1S1uZPgsAqU6JhGks5rZQCpgaJpZM4L15LU>
.
|
The travis failure is a current bug in Travis, which we are waiting on. |
I restarted the Travis build. @varunagrawal note you can force a rebuild of all the CIs by pushing a new commit in this branch. My favourite way to do that is |
The Travis flake8 error is genuine. Setting up on-the-fly flake8 checks inside your favourite editor is a great investment to avoid this kind of things. |
e9b21db
to
24e9a11
Compare
Codecov Report@@ Coverage Diff @@
## master #8280 +/- ##
==========================================
+ Coverage 94.73% 94.73% +<.01%
==========================================
Files 342 342
Lines 60674 60675 +1
==========================================
+ Hits 57482 57483 +1
Misses 3192 3192
Continue to review full report at Codecov.
|
@lesteve gotcha! Fixed them and submitted the changes. Thank you for the review. 😃 |
I might note that in Otherwise LGTM |
@jnothman so as per this answer on StackOverflow, this does seem to be the correct approach to take. I had the same doubts as you so I went and looked up half a dozen resources and worked out the arguments in my head before posting an issue about this. Thanks for your review. |
And thanks for the research.
…On 24 February 2017 at 05:40, Varun Agrawal ***@***.***> wrote:
@jnothman <https://github.com/jnothman> so as per this answer
<http://stats.stackexchange.com/questions/1773/what-are-correct-values-for-precision-and-recall-in-edge-cases>
on StackOverflow, this does seem to be the correct approach to take. I had
the same doubts as you so I went and looked up half a dozen resources and
worked out the arguments in my head before posting an issue about this.
The issue is #8245
<#8245> for your
reference.
Thanks for your review.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8280 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6wckUWZJUywdsu6_Cudljqov_SUPks5rfdKfgaJpZM4L15LU>
.
|
The more I think about this, the less I am sure about it, it would be great to find a more authoritative source clearly specifying what the recall value should be when TP + FN == 0, maybe NaN is actually the best we can do ... |
@lesteve we've had this discussion with respect to just measuring P and R
years ago. This is different because we want a limit. We try to avoid nan
because we average these things.
…On 24 Feb 2017 10:39 pm, "Loïc Estève" ***@***.***> wrote:
@jnothman <https://github.com/jnothman> so as per this answer on
StackOverflow, this does seem to be the correct approach to take
The more I think about this, the less I am sure about it, it would be
great to find a more authoritative source clearly specifying what the
recall value should be when TP + FN == 0, maybe NaN is actually the best we
can do ...
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8280 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz65KGIIUO3IgbaEoRRK7o6ok3Ffemks5rfsF5gaJpZM4L15LU>
.
|
@jnothman @lesteve sorry for being AWOL. Paper deadlines have not been kind. From an authoritative standpoint, I believe the original PASCAL paper is our best bet. |
Haven't had much time for reviewing, and I expect little availability for the next couple of weeks. Please be patient, and ping again if necessary. |
@lesteve, as opposed to in |
24e9a11
to
b0b9bca
Compare
Potentially yes, though hopefully for real-world data the changes are less significant. I don't think a bug in software that you used is an ethics violation. Relative statements are likely to also hold with the fix, absolute statements depend a lot on what you compare against. Some people use "interpolated mAP" which is even more optimistic than our previous implementation. |
The changes are severely significant in my case. My model went from being state-of-the-art to worse than current by a significant margin. If I hope to publish these results, I'll have to refute the claims made in my citations and that's going to be another mess altogether. |
Does the current implementation follow the Pascal VOC formulation like the previous one or is this based more on the information retrieval aspect of things? |
@varunagrawal I'm sorry the bug is creating so much trouble for you. Again, are you comparing against a number in the literat F438 ure? Neither the previous one nor the current one is the PascalVOC, which implements something that is not standard in the literature. The PascalVOC is much more optimistic than any other measure. |
We discussed implementing the PascalVOC one, but the PascalVOC paper disagrees with the code in the submission tools (the paper quotes an 11-point interpolation, the matlab code does general interpolation). Neither version is found anywhere else in the literature as far as I can tell. |
check out #4577 |
I think we might be open to adding the VOC variant so that people can more easily compare to the computer vision literature. |
@amueller thank you for your help on this. I was just about to mention adding a separate PascalVOC version since I myself work in computer vision and all the papers I have referenced use that form of mAP calculation, thus making it standard from a peer review perspective. Thank you for the link to the argument showing PascalVOC got it wrong. I'll be sure to inform my advisor and see what we can do. However at this time, I'll be glad to help contribute an implementation of the PascalVOC version, thus allowing a nice separation between the information retrieval community and the computer vision community. :) |
@varunagrawal feel free to open a PR to add an argument to the average_precision function. I'd also welcome some additions to the user guide explaining the differences. And for comparing against other computer vision papers, you should not use the version currently implemented in sklearn, but the PascalVOC one. So I don't think you're in trouble. |
Yup, the moment you mentioned PascalVOC and the IR definitions being different, I let out a sigh of relief. |
@amueller @GaelVaroquaux can you please provide more information on what is going in this commit? What I don't understand is that there seems to have been some effort in parameterizing the interpolation type being used, but there are addition commits on top of those which remove those changes. |
That commit is the (poorly-titled) result of #9017. @vene and @GaelVaroquaux seem to have concluded in person that there were problems in providing averaged eleven-point precision. |
Alright, so it seems that I can update the |
@amueller @qinhanmin2014 I know it's been a while since we touched upon this PR. I am more than happy to finish updating this sometime next week if we have a consensus on what is the right thing to do here. |
Do you recall the discussion well enough to summarise it? |
The whole PascalVOC debate should ideally be settled in a different PR. I can open an issue later if that works. |
Reference Issue
Fixes #8245
What does this implement/fix? Explain your changes.
When all the
y_true
labels are negative,precision_recall_curve
returnsnan
because ofrecall
being set tonan
instead of1
. This is because of the direction division of thetps
vector bytps[-1]
which happens to be0
.This fix checks if
tps[-1]
is0
and if yes, sets the recall to1
directly since there are no True Positives or False Negatives, else we calculaterecall
as normal.Any other comments?
I had to update
test_precision_recall_curve_toydata
since this test was expecting theTrueDivide
exception to be raised which is no longer the case as a result of this fix. I added 2 test cases, one to check when all truth labels are negative and the other to check when all truth labels are positive to ensure precision calculation is accurate.