-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Precision Recall and F-score: behavior when all negative #14876
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
(Having some expertise in NER evaluation, the thought of using Scikit-learn's PRF implementation for NER seems a bit awkward. I'm also not entirely convinced that an average of scores per document is going to be very informative if some documents have no entities... Rather you might want the average recall over only those documents with entities, and average of precision over only those documents that have predictions...) Yes, in principle I could consider supporting returning 1... not that I did back when we settled on the current behaviour in ?2013. Not convinced we need a "raise an error" option (all warnings can be converted to errors using Please specify:
|
Copy-paste my comment from #12312 |
I agree that 1 might be a more reasonable choice, but we use 0 everywhere and I'm not sure whether it's worthwhile to add a parameter. |
1 is sometimes a reasonable choice. We avoided it in part because it's
generally safer to report a lower bound on a metric than an upper bound :)
Reporting 0 with a warning raises awareness.
|
About the NER choice, thanks for your comment. I know it's not standard usage but my intuition goes like this. I want to reward the system when it returns all negatives and there aren't, but I don't want to use accuracy because in the documents with entities I don't want to boost the score for that document based on detecting a lot of TNs. Does it make more sense written like this? Plus, in my scenario most documents have 1 entity max, and I don't want the length of the entity to put more weight in one document over another Let's do the F-score case but I can replicate it for prec/rec if not clear. Consider it as: TP / (TP + 0.5*(FP + FN)). what would the parameter be called
then any of the 6 LGTM, but if I have to choose one I would say what would you return if y_true is all negative, but positives are in y_pred what would you return if y_pred is all negative, but positives are in y_true what would you return if y_true and y_pred are all negative |
Can you please frame in terms of precision and recall, in the case of `
return_one_if_no_positives=True`? F1 is obvious given P & R.
Maybe the parameter should just be called `zero_division` and we can set it
to 0 or 1.
|
Sure Precision = TP / (TP + FP):
Recall = TP / (TP + FN):
about the naming: for me It's less self-explanatory, but shorter, and the value of the param is clearer. so it's fine to me |
"the model is doing wrong here"... but not on that metric, only on the
complementary metric..? I'd go for 1 in the "open to change" cases :)
|
yeah that's ok. So I think the scope is defined, unless there's something else. This would change only shall I start to work on this and open a PR when I have it? |
average_precision_score works differently, and for good reason. If you'd
like to open a pull request we will review it...
|
no, see scikit-learn/sklearn/metrics/ranking.py Line 568 in 96bfae6
scikit-learn/sklearn/metrics/ranking.py Line 478 in 96bfae6
|
could you please provide more details? @jnothman |
@marctorrellas I'm unable to understand your solution. |
I agreed to go with a 1 for the "open to change" cases :) |
I started to do this and found a small problem in which I see two options. So prefer to ask. Method The problem comes with the warning message, which goes along the lines: "Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples" The problem is that when computing precision, I don't know what value I'm going to set for F-score, because it is dependent on recall. Therefore, I see two options (but enlighten me if u find a better one):
I prefer option1, but I understand some people won't like having an ambiguous message. What do you reckon? |
How about if zero_division != 'warn', we give no warning. But we set 'warn'
by default and say "spam is I'll defined and being set to 0. Use the
zero_division parameter to control this behavior"
|
I updated the PR description so that we have all cases clear. In the dev of this, I've found that f-score can fail if lists are passed instead of np.arrays. Do you want me to fix this as part of this PR? @jnothman @qinhanmin2014 Example:
|
No, see #14865.
|
I think at one point (before 2013?) the behaviour was somewhat inconsistent, or there was debate about what the right behaviour should be. We settled on a warning strategy, and a conservative approach to awarding score: don't award perfect precision to a system just because it gave no positive labels for that class/sample. |
Uhh,... it seems I'd forgotten to send that reply, @qinhanmin2014 |
any updates on this? |
the PR #14900 was merged |
I've seen these already closed issues (and probably there are more):
#13453
#13143
about the behavior of prec/rec/f1 when both predictions and ground truth are all negatives. See for some discussion about the topic
https://stats.stackexchange.com/questions/8025/what-are-correct-values-for-precision-and-recall-when-the-denominators-equal-0
My opinion is that sklearn should be flexible about this. I agree with having a default behavior of returning 0 + warning, but I would like to have the option to say: For these special cases, if the true positives, false positives and false negatives are all 0, the precision, recall and F1-measure are 1.
I use this in my day-to-day and atm I have a wrapper around these scorers to force this. It's not exactly the same scenario but to simplify assume you're doing named entity recognition (with one entity to make it easy) and want to apply f1-score to each document, and then average them. If a document has no entities and your model gives no entities you would like a perfect score. Unfortunately, the current sklearn scores don't allow you to do that
This is a proposal, and I can do the PR if people here find it useful. My idea is to add a parameter to these scorers so that the default is to raise a warning (and return 0), and the user can set to new modes:
a) raise an error
b) warning + return 1
Please let me know your thoughts
The text was updated successfully, but these errors were encountered: