8000 Precision Recall and F-score: behavior when all negative · Issue #14876 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Precision Recall and F-score: behavior when all negative #14876

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
marctorsoc opened this issue Sep 3, 2019 · 22 comments
Closed

Precision Recall and F-score: behavior when all negative #14876

marctorsoc opened this issue Sep 3, 2019 · 22 comments

Comments

@marctorsoc
Copy link
Contributor

I've seen these already closed issues (and probably there are more):

#13453
#13143

about the behavior of prec/rec/f1 when both predictions and ground truth are all negatives. See for some discussion about the topic

https://stats.stackexchange.com/questions/8025/what-are-correct-values-for-precision-and-recall-when-the-denominators-equal-0

My opinion is that sklearn should be flexible about this. I agree with having a default behavior of returning 0 + warning, but I would like to have the option to say: For these special cases, if the true positives, false positives and false negatives are all 0, the precision, recall and F1-measure are 1.

I use this in my day-to-day and atm I have a wrapper around these scorers to force this. It's not exactly the same scenario but to simplify assume you're doing named entity recognition (with one entity to make it easy) and want to apply f1-score to each document, and then average them. If a document has no entities and your model gives no entities you would like a perfect score. Unfortunately, the current sklearn scores don't allow you to do that

This is a proposal, and I can do the PR if people here find it useful. My idea is to add a parameter to these scorers so that the default is to raise a warning (and return 0), and the user can set to new modes:

a) raise an error
b) warning + return 1

Please let me know your thoughts

@jnothman
Copy link
Member
jnothman commented Sep 4, 2019

(Having some expertise in NER evaluation, the thought of using Scikit-learn's PRF implementation for NER seems a bit awkward. I'm also not entirely convinced that an average of scores per document is going to be very informative if some documents have no entities... Rather you might want the average recall over only those documents with entities, and average of precision over only those documents that have predictions...)

Yes, in principle I could consider supporting returning 1... not that I did back when we settled on the current behaviour in ?2013. Not convinced we need a "raise an error" option (all warnings can be converted to errors using warnings), nor that we necessarily need a warning as well as returning 1.

Please specify:

  • what would the parameter be called
  • what would you return if y_true is all negative, but positives are in y_pred
  • what would you return if y_pred is all negative, but positives are in y_true
  • what would you return if y_true and y_pred are all negative

@qinhanmin2014
Copy link
Member

Copy-paste my comment from #12312
(1) How to define precision when TP + FP = 0? In precision_score, we raise a warning and return 0. In average_precision_score, we return 0 (in my PR #9980). But in a SO question provided by the contributor of #8280, it's defined as 1.
(2) How to define recall when TP + FN = 0? In recall_score, we raise a warning and return 0. In average_precision_score, we raise a warning and return nan (the nan issue here). In #8280, the contributor return 1 in average_precision_score (consistent with the SO question he provided), which is inconsistent with our recall_score.

@qinhanmin2014
Copy link
Member

I agree that 1 might be a more reasonable choice, but we use 0 everywhere and I'm not sure whether it's worthwhile to add a parameter.

@jnothman
Copy link
Member
jnothman commented Sep 4, 2019 via email

@marctorsoc
Copy link
Contributor Author

About the NER choice, thanks for your comment. I know it's not standard usage but my intuition goes like this. I want to reward the system when it returns all negatives and there aren't, but I don't want to use accuracy because in the documents with entities I don't want to boost the score for that document based on detecting a lot of TNs. Does it make more sense written like this? Plus, in my scenario most documents have 1 entity max, and I don't want the length of the entity to put more weight in one document over another

Let's do the F-score case but I can replicate it for prec/rec if not clear. Consider it as: TP / (TP + 0.5*(FP + FN)).
Given that we discarded the raise an error option (which I'm not interested either), we can simplify this to a bool

what would the parameter be called
The name is always very debatable, I can think of:

  • return_one_if / return_zero_if / penalize_if
  • all_negatives / no_positives

then any of the 6 LGTM, but if I have to choose one I would say return_one_if_no_positives, with default=False, so that it does the same as now

what would you return if y_true is all negative, but positives are in y_pred
in such a case I just recall to the expression above, TP=0, FP>0, FN=anything --> return 0

what would you return if y_pred is all negative, but positives are in y_true
Again, TP=0, FP=anything, FN>0 --> return 0

what would you return if y_true and y_pred are all negative
Using the formula is 0/0, it used to return 0 + warning, and the aim of this issue/PR is to return 1 + warning if the parameter is set accordingly

@jnothman
Copy link
Member
jnothman commented Sep 4, 2019 via email

@marctorsoc
Copy link
Contributor Author

Sure

Precision = TP / (TP + FP):

  • if y_true and y_pred are all negative: return 1.0 + warning (if zero_division=1) otherwise 0 + warning
  • if y_true all negative, y_pred some positive, return 0 as usual as TP=0 and FP>0, no warning
  • if y_pred all negative, y_true some positive, return 0.0 + warning (no matter zero_division value). This is my choice but I'm open to change it. This makes f1 no more prec*rec/ (prec + rec), but the values returned make more sense to me as the model is doing wrong here so it should be a 0 IMO

Recall = TP / (TP + FN):

  • if y_true and y_pred are all negative: return 1.0 + warning (if zero_division=1) otherwise 0 + warning
  • if y_true all negative, y_pred some positive, return 0 + warning (no matter zero_division value). This is my choice but I'm open to change it. This makes f1 no more prec*rec/ (prec + rec), but the values returned make more sense to me as the model is doing wrong here so it should be a 0 IMO
  • if y_pred all negative, y_true some positive, return 0 as usual as TP=0 and FN>0, no warning

about the naming: for me It's less self-explanatory, but shorter, and the value of the param is clearer. so it's fine to me

@jnothman
Copy link
Member
jnothman commented Sep 4, 2019 via email

@marctorsoc
Copy link
Contributor Author

yeah that's ok. So I think the scope is defined, unless there's something else. This would change only precision_score , recall_score, and f1_score, maybe fbeta_score as well? I guess average_precision_score call precision_score so will be using the updated ones right?

shall I start to work on this and open a PR when I have it?

@jnothman
Copy link
Member
jnothman commented Sep 4, 2019 via email

@qinhanmin2014
Copy link
Member

I guess average_precision_score call precision_score so will be using the updated ones right?

no, see

def precision_recall_curve(y_true, probas_pred, pos_label=None,

def _binary_clf_curve(y_true, y_score, pos_label=None, sample_weight=None):

@qinhanmin2014
Copy link
Member

We avoided it in part because it's
generally safer to report a lower bound on a metric than an upper bound :)

could you please provide more details? @jnothman

@qinhanmin2014
Copy link
Member

Precision = TP / (TP + FP):
if y_true and y_pred are all negative: return 1.0 + warning (if zero_division=1) otherwise 0 + warning
if y_true all negative, y_pred some positive, return 0 as usual as TP=0 and FP>0, no warning
if y_pred all negative, y_true some positive, return 0.0 + warning (no matter zero_division value). This is my choice but I'm open to change it. This makes f1 no more prec*rec/ (prec + rec), but the values returned make more sense to me as the model is doing wrong here so it should be a 0 IMO

@marctorrellas I'm unable to understand your solution.
In the first situatoin and the third situation, TP+FP=0, but the solution is different. When TP+FP=0, why do we need to look at y_true?

@marctorsoc
Copy link
Contributor Author

I agreed to go with a 1 for the "open to change" cases :)

@marctorsoc
Copy link
Contributor Author
marctorsoc commented Sep 6, 2019

I started to do this and found a small problem in which I see two options. So prefer to ask. Method _prf_divide is used in precision_recall_fscore_support to obtain both precision and recall. It is easy to set them to zero_division when there is a zero division :D

The problem comes with the warning message, which goes along the lines:

"Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples"

The problem is that when computing precision, I don't know what value I'm going to set for F-score, because it is dependent on recall. Therefore, I see two options (but enlighten me if u find a better one):

  • Change the message to something like:
     Precision being set to 0.0 and F-score will be set according to 
     `zero_division` parameter (default=0) if both precision and recall are
     undefined, otherwise 0.
  • Return something extra when computing precision, to be used when computing recall, so that we can clearly tell what F-score is gonna be set to. This looks ugly to do if there are multiple classes

I prefer option1, but I understand some people won't like having an ambiguous message. What do you reckon?

@jnothman
Copy link
Member
jnothman commented Sep 6, 2019 via email

@marctorsoc
Copy link
Contributor Author
marctorsoc commented Sep 7, 2019

I updated the PR description so that we have all cases clear. In the dev of this, I've found that f-score can fail if lists are passed instead of np.arrays. Do you want me to fix this as part of this PR? @jnothman @qinhanmin2014

Example:

f1_score(np.array([[1, 1], [1, 1]]), np.array([[0, 0], [0, 0]]),average='micro')
0.0
f1_score([[1, 1], [1, 1]], [[0, 0], [0, 0]], average="micro")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/marctorrellassocastro/miniconda3/envs/deep_eigen/lib/python3.7/site-packages/sklearn/metrics/classification.py", line 1059, in f1_score
    sample_weight=sample_weight)
  File "/Users/marctorrellassocastro/miniconda3/envs/deep_eigen/lib/python3.7/site-packages/sklearn/metrics/classification.py", line 1182, in fbeta_score
    sample_weight=sample_weight)
  File "/Users/marctorrellassocastro/miniconda3/envs/deep_eigen/lib/python3.7/site-packages/sklearn/metrics/classification.py", line 1415, in precision_recall_fscore_support
    pos_label)
  File "/Users/marctorrellassocastro/miniconda3/envs/deep_eigen/lib/python3.7/site-packages/sklearn/metrics/classification.py", line 1239, in _check_set_wise_labels
    y_type, y_true, y_pred = _check_targets(y_true, y_pred)
  File "/Users/marctorrellassocastro/miniconda3/envs/deep_eigen/lib/python3.7/site-packages/sklearn/metrics/classification.py", line 88, in _check_targets
    raise ValueError("{0} is not supported".format(y_type))
ValueError: multiclass-multioutput is not supported

@jnothman
Copy link
Member
jnothman commented Sep 8, 2019 via email

@jnothman
Copy link
Member
jnothman commented Sep 9, 2019

We avoided it in part because it's generally safer to report a lower bound on a metric than an upper bound :)

I think at one point (before 2013?) the behaviour was somewhat inconsistent, or there was debate about what the right behaviour should be. We settled on a warning strategy, and a conservative approach to awarding score: don't award perfect precision to a system just because it gave no positive labels for that class/sample.

@jnothman
Copy link
Member
jnothman commented Sep 9, 2019

Uhh,... it seems I'd forgotten to send that reply, @qinhanmin2014

@amrsharaf
Copy link

any updates on this?

@marctorsoc
Copy link
Contributor Author

any updates on this?

the PR #14900 was merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
0