Precision Recall and F-score: behavior when all negative #14876

marctorsoc · 2019-09-03T22:29:33Z

I've seen these already closed issues (and probably there are more):

about the behavior of prec/rec/f1 when both predictions and ground truth are all negatives. See for some discussion about the topic

https://stats.stackexchange.com/questions/8025/what-are-correct-values-for-precision-and-recall-when-the-denominators-equal-0

My opinion is that sklearn should be flexible about this. I agree with having a default behavior of returning 0 + warning, but I would like to have the option to say: For these special cases, if the true positives, false positives and false negatives are all 0, the precision, recall and F1-measure are 1.

I use this in my day-to-day and atm I have a wrapper around these scorers to force this. It's not exactly the same scenario but to simplify assume you're doing named entity recognition (with one entity to make it easy) and want to apply f1-score to each document, and then average them. If a document has no entities and your model gives no entities you would like a perfect score. Unfortunately, the current sklearn scores don't allow you to do that

This is a proposal, and I can do the PR if people here find it useful. My idea is to add a parameter to these scorers so that the default is to raise a warning (and return 0), and the user can set to new modes:

a) raise an error
b) warning + return 1

Please let me know your thoughts

jnothman · 2019-09-04T07:24:12Z

(Having some expertise in NER evaluation, the thought of using Scikit-learn's PRF implementation for NER seems a bit awkward. I'm also not entirely convinced that an average of scores per document is going to be very informative if some documents have no entities... Rather you might want the average recall over only those documents with entities, and average of precision over only those documents that have predictions...)

Yes, in principle I could consider supporting returning 1... not that I did back when we settled on the current behaviour in ?2013. Not convinced we need a "raise an error" option (all warnings can be converted to errors using warnings), nor that we necessarily need a warning as well as returning 1.

Please specify:

what would the parameter be called
what would you return if y_true is all negative, but positives are in y_pred
what would you return if y_pred is all negative, but positives are in y_true
what would you return if y_true and y_pred are all negative

qinhanmin2014 · 2019-09-04T08:33:13Z

Copy-paste my comment from #12312
(1) How to define precision when TP + FP = 0? In precision_score, we raise a warning and return 0. In average_precision_score, we return 0 (in my PR #9980). But in a SO question provided by the contributor of #8280, it's defined as 1.
(2) How to define recall when TP + FN = 0? In recall_score, we raise a warning and return 0. In average_precision_score, we raise a warning and return nan (the nan issue here). In #8280, the contributor return 1 in average_precision_score (consistent with the SO question he provided), which is inconsistent with our recall_score.

qinhanmin2014 · 2019-09-04T08:35:08Z

I agree that 1 might be a more reasonable choice, but we use 0 everywhere and I'm not sure whether it's worthwhile to add a parameter.

jnothman · 2019-09-04T08:49:26Z

1 is sometimes a reasonable choice. We avoided it in part because it's generally safer to report a lower bound on a metric than an upper bound :) Reporting 0 with a warning raises awareness.

marctorsoc · 2019-09-04T10:12:59Z

About the NER choice, thanks for your comment. I know it's not standard usage but my intuition goes like this. I want to reward the system when it returns all negatives and there aren't, but I don't want to use accuracy because in the documents with entities I don't want to boost the score for that document based on detecting a lot of TNs. Does it make more sense written like this? Plus, in my scenario most documents have 1 entity max, and I don't want the length of the entity to put more weight in one document over another

Let's do the F-score case but I can replicate it for prec/rec if not clear. Consider it as: TP / (TP + 0.5*(FP + FN)).
Given that we discarded the raise an error option (which I'm not interested either), we can simplify this to a bool

what would the parameter be called
The name is always very debatable, I can think of:

return_one_if / return_zero_if / penalize_if
all_negatives / no_positives

then any of the 6 LGTM, but if I have to choose one I would say return_one_if_no_positives, with default=False, so that it does the same as now

what would you return if y_true is all negative, but positives are in y_pred
in such a case I just recall to the expression above, TP=0, FP>0, FN=anything --> return 0

what would you return if y_pred is all negative, but positives are in y_true
Again, TP=0, FP=anything, FN>0 --> return 0

what would you return if y_true and y_pred are all negative
Using the formula is 0/0, it used to return 0 + warning, and the aim of this issue/PR is to return 1 + warning if the parameter is set accordingly

jnothman · 2019-09-04T10:25:48Z

Can you please frame in terms of precision and recall, in the case of ` return_one_if_no_positives=True`? F1 is obvious given P & R. Maybe the parameter should just be called `zero_division` and we can set it to 0 or 1.

marctorsoc · 2019-09-04T12:30:51Z

Sure

Precision = TP / (TP + FP):

if y_true and y_pred are all negative: return 1.0 + warning (if zero_division=1) otherwise 0 + warning
if y_true all negative, y_pred some positive, return 0 as usual as TP=0 and FP>0, no warning
if y_pred all negative, y_true some positive, return 0.0 + warning (no matter zero_division value). This is my choice but I'm open to change it. This makes f1 no more prec*rec/ (prec + rec), but the values returned make more sense to me as the model is doing wrong here so it should be a 0 IMO

Recall = TP / (TP + FN):

if y_true and y_pred are all negative: return 1.0 + warning (if zero_division=1) otherwise 0 + warning
if y_true all negative, y_pred some positive, return 0 + warning (no matter zero_division value). This is my choice but I'm open to change it. This makes f1 no more prec*rec/ (prec + rec), but the values returned make more sense to me as the model is doing wrong here so it should be a 0 IMO
if y_pred all negative, y_true some positive, return 0 as usual as TP=0 and FN>0, no warning

about the naming: for me It's less self-explanatory, but shorter, and the value of the param is clearer. so it's fine to me

jnothman · 2019-09-04T15:15:21Z

"the model is doing wrong here"... but not on that metric, only on the complementary metric..? I'd go for 1 in the "open to change" cases :)

marctorsoc · 2019-09-04T15:32:36Z

yeah that's ok. So I think the scope is defined, unless there's something else. This would change only precision_score , recall_score, and f1_score, maybe fbeta_score as well? I guess average_precision_score call precision_score so will be using the updated ones right?

shall I start to work on this and open a PR when I have it?

jnothman · 2019-09-04T21:05:42Z

average_precision_score works differently, and for good reason. If you'd like to open a pull request we will review it...

qinhanmin2014 · 2019-09-05T03:58:16Z

I guess average_precision_score call precision_score so will be using the updated ones right?

no, see

scikit-learn/sklearn/metrics/ranking.py

Line 568 in 96bfae6

def precision_recall_curve(y_true, probas_pred, pos_label=None,

scikit-learn/sklearn/metrics/ranking.py

Line 478 in 96bfae6

def _binary_clf_curve(y_true, y_score, pos_label=None, sample_weight=None):

qinhanmin2014 · 2019-09-05T03:59:21Z

We avoided it in part because it's
generally safer to report a lower bound on a metric than an upper bound :)

could you please provide more details? @jnothman

qinhanmin2014 · 2019-09-05T08:21:33Z

Precision = TP / (TP + FP):
if y_true and y_pred are all negative: return 1.0 + warning (if zero_division=1) otherwise 0 + warning
if y_true all negative, y_pred some positive, return 0 as usual as TP=0 and FP>0, no warning
if y_pred all negative, y_true some positive, return 0.0 + warning (no matter zero_division value). This is my choice but I'm open to change it. This makes f1 no more prec*rec/ (prec + rec), but the values returned make more sense to me as the model is doing wrong here so it should be a 0 IMO

@marctorrellas I'm unable to understand your solution.
In the first situatoin and the third situation, TP+FP=0, but the solution is different. When TP+FP=0, why do we need to look at y_true?

marctorsoc · 2019-09-05T21:41:44Z

I agreed to go with a 1 for the "open to change" cases :)

marctorsoc · 2019-09-06T00:33:26Z

I started to do this and found a small problem in which I see two options. So prefer to ask. Method _prf_divide is used in precision_recall_fscore_support to obtain both precision and recall. It is easy to set them to zero_division when there is a zero division :D

The problem comes with the warning message, which goes along the lines:

"Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples"

The problem is that when computing precision, I don't know what value I'm going to set for F-score, because it is dependent on recall. Therefore, I see two options (but enlighten me if u find a better one):

Change the message to something like:

     Precision being set to 0.0 and F-score will be set according to 
     `zero_division` parameter (default=0) if both precision and recall are
     undefined, otherwise 0.

Return something extra when computing precision, to be used when computing recall, so that we can clearly tell what F-score is gonna be set to. This looks ugly to do if there are multiple classes

I prefer option1, but I understand some people won't like having an ambiguous message. What do you reckon?

jnothman · 2019-09-06T01:07:39Z

How about if zero_division != 'warn', we give no warning. But we set 'warn' by default and say "spam is I'll defined and being set to 0. Use the zero_division parameter to control this behavior"

marctorsoc · 2019-09-07T09:55:07Z

I updated the PR description so that we have all cases clear. In the dev of this, I've found that f-score can fail if lists are passed instead of np.arrays. Do you want me to fix this as part of this PR? @jnothman @qinhanmin2014

Example:

f1_score(np.array([[1, 1], [1, 1]]), np.array([[0, 0], [0, 0]]),average='micro')
0.0
f1_score([[1, 1], [1, 1]], [[0, 0], [0, 0]], average="micro")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/marctorrellassocastro/miniconda3/envs/deep_eigen/lib/python3.7/site-packages/sklearn/metrics/classification.py", line 1059, in f1_score
    sample_weight=sample_weight)
  File "/Users/marctorrellassocastro/miniconda3/envs/deep_eigen/lib/python3.7/site-packages/sklearn/metrics/classification.py", line 1182, in fbeta_score
    sample_weight=sample_weight)
  File "/Users/marctorrellassocastro/miniconda3/envs/deep_eigen/lib/python3.7/site-packages/sklearn/metrics/classification.py", line 1415, in precision_recall_fscore_support
    pos_label)
  File "/Users/marctorrellassocastro/miniconda3/envs/deep_eigen/lib/python3.7/site-packages/sklearn/metrics/classification.py", line 1239, in _check_set_wise_labels
    y_type, y_true, y_pred = _check_targets(y_true, y_pred)
  File "/Users/marctorrellassocastro/miniconda3/envs/deep_eigen/lib/python3.7/site-packages/sklearn/metrics/classification.py", line 88, in _check_targets
    raise ValueError("{0} is not supported".format(y_type))
ValueError: multiclass-multioutput is not supported

jnothman · 2019-09-08T00:25:32Z

No, see #14865.

jnothman · 2019-09-09T12:29:11Z

We avoided it in part because it's generally safer to report a lower bound on a metric than an upper bound :)

I think at one point (before 2013?) the behaviour was somewhat inconsistent, or there was debate about what the right behaviour should be. We settled on a warning strategy, and a conservative approach to awarding score: don't award perfect precision to a system just because it gave no positive labels for that class/sample.

jnothman · 2019-09-09T12:29:32Z

Uhh,... it seems I'd forgotten to send that reply, @qinhanmin2014

amrsharaf · 2020-05-21T17:50:14Z

any updates on this?

marctorsoc · 2020-05-21T20:00:39Z

any updates on this?

the PR #14900 was merged

marctorsoc mentioned this issue Sep 6, 2019

ENH: zero_division parameter for classification… #14900

Merged

marctorsoc closed this as completed May 21, 2020

suryagutta mentioned this issue Mar 21, 2021

UndefinedMetricWarnings while running classification/main_train.py on SEN12MS Berkeley-Data/hpt#52

Open

marctorsoc mentioned this issue Feb 27, 2022

Add zero_division None or np.nan #22625

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Precision Recall and F-score: behavior when all negative #14876

Precision Recall and F-score: behavior when all negative #14876

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Precision Recall and F-score: behavior when all negative #14876

Precision Recall and F-score: behavior when all negative #14876

Comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!