8000 precision_recall_curve - assumed limits can be misleading · Issue #4223 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

precision_recall_curve - assumed limits can be misleading #4223

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
trevorstephens opened this issue Feb 8, 2015 · 27 comments
Closed

precision_recall_curve - assumed limits can be misleading #4223

trevorstephens opened this issue Feb 8, 2015 · 27 comments

Comments

@trevorstephens
Copy link
Contributor

Referencing "The last precision and recall values are 1. and 0. respectively and do not have a corresponding threshold. This ensures that the graph starts on the x axis."

Say you have a tie in your highest predicted probability output from the classifier with some false positives and some true positives, for example:

y_true = [ 1,  0,  0,  1,  0,  1,  0,  0,  0,  1,  0]
y_pred = [.9, .9, .9, .8, .8, .7, .7, .6, .5, .4, .3]
precision, recall, thresholds = precision_recall_curve(y_true, y_pred)

Yields:

Pricision: [0.400, 0.333, 0.375, 0.428, 0.400, 0.333, 1.000]
Recall:    [1.000, 0.750, 0.750, 0.750, 0.500, 0.250, 0.000]

download 1

I understand that recall goes to zero in the limit, but precision might not be so clear-cut. A really difficult problem where your top prediction is a false positive would have precision go to zero in the limit I believe. For plotting with this function, this case is probably not a big deal a vertical line from 0 to 1 would be hidden by the y-axis.

But for the drawn top-preds, as seen above, the plot becomes misleading and makes the viewer think that at least their top prediction was true positive. This may seem like a corner case, but I ran into it on a tough classification problem I was working on and was a little baffled by the output until I checked the code out.

A quick fix to preserve the intention of a clean P-R plot might be to draw a horizontal line from the highest threshold to the y-axis, this should not alter the output of most cases like here. Trying to actually calculate where it's going in the limit might be a bit overkill :-)

I would also think that adding a 1 to the end of the thresholds vector would be helpful when plotting both precision and recall against the probabilities.

I'm happy to open a PR if anyone thinks this is worth addressing.

@jnothman
Copy link
Member
jnothman commented Feb 9, 2015

Can you find literature that formally discusses the case of threshold higher than top score (which need not be bounded by 1, as it's not necessarily a probability in this function)?

As I understand it, PR curves traditionally descend from P=1, R=0, although perhaps in http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-ranked-retrieval-results-1.html the PR-curve does not actually meet the Y axis (but I'm not sure I'm reading it correctly)...?

More important, perhaps, is to check that average precision, as the area under this curve, agrees with definitions elsewhere.

@trevorstephens
Copy link
Contributor Author

Can you find literature that formally discusses the case of threshold higher than top score

Shall do a bit of a search over the coming days.

(which need not be bounded by 1, as it's not necessarily a probability in this function)?

And yes, forgot about predict_log_proba etc. Perhaps if this goes anywhere a function arg like threshold= {None, 'proba', 'log_proba'...} would be helpful to some people though to automagically add some limiting threshold at the end for ease of plotting?

As I understand it, PR curves traditionally descend from P=1, R=0, although perhaps in http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-ranked-retrieval-results-1.html the PR-curve does not actually meet the Y axis (but I'm not sure I'm reading it correctly)...?

I'll check it out, looks like some averaging going on at a quick glance.

More important, perhaps, is to check that average precision, as the area under this curve, agrees with definitions elsewhere.

This is more involved than I thought :-) Will check it out.

@jnothman
Copy link
Member

decision_function is more the issue than predict_log_proba; it allows this
metric to be used with uncalibrated SVC, for instance.

There is averaging in the red plot in the IR book, but what's happening in
the left of the blue plot is less clear.

Thanks for looking into it.

On 10 February 2015 at 14:38, Trevor Stephens notifications@github.com
wrote:

Can you find literature that formally discusses the case of threshold
higher than top score

Shall do a bit of a search over the coming days.

(which need not be bounded by 1, as it's not necessarily a probability in
this function)?

And yes, forgot about predict_log_proba etc. Perhaps if this goes anywhere
a function arg like threshold= {None, 'proba', 'log_proba'...} would be
helpful to some people though to automagically add some limiting threshold
at the end for ease of plotting?

As I understand it, PR curves traditionally descend from P=1, R=0,
although perhaps in
http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-ranked-retrieval-results-1.html
the PR-curve does not actually meet the Y axis (but I'm not sure I'm
reading it correctly)...?

I'll check it out, looks like some averaging going on at a quick glance.

More important, perhaps, is to check that average precision, as the area
under this curve, agrees with definitions elsewhere.

This is more involved than I thought :-) Will check it out.


Reply to this email directly or view it on GitHub
#4223 (comment)
.

@trevorstephens
Copy link
Contributor Author

@jnothman , I've found a couple of references that are somewhat relevant, interested in your thoughts so far:

The Relationship between Recall and Precision @ 297 cites does not specifically address the question, but has several examples of P-R plots hitting the y-axis at values other than one. (This link may require manual saving as a pdf out of the browser)

The ROCR package in R does not appear to extrapolate beyond the thesholds relevant to the predictions.

This CrossValidated thread is the closest thing I have stumbled across, where quite a few different takes are presented.

Is it correct that, as true positives and false positives approach 0, the precision approaches 1? Same question for recall

Most interesting to me is Rob Hyndman's (somewhat brief) take:

If false positives and false negatives both approach zero at a faster rate than true positives, then yes to both questions. But otherwise, not necessarily.

He's a well respected statistician in the energy forecasting space, but no paper to refer to :-/

@jnothman
Copy link
Member

I've just recalled this. You seem chuffed to find Hyndman's opinion in agreement with you.

I'm still most curious to see what definitions of average precision say: for users of precision_recall_curve for visualisation it's fine to provide an option. For average_precision we need to have a more practically consistent answer.

@trevorstephens
Copy link
Contributor Author

I've just recalled this.

I see what you did there... :-)

You seem chuffed to find Hyndman's opinion in agreement with you.

It was more the fact I knew his name that caught my eye. He's very active in the area in which I work.

I'm still most curious to see what definitions of average precision say: for users of precision_recall_curve for visualisation it's fine to provide an option. For average_precision we need to have a more practically consistent answer.

As in a function parameter like y-intercept=True/False or something? I shall try to look into average precision too. Hopefully there's some more discussion on the corner cases there.

@trevorstephens
Copy link
Contributor Author

Sorry for the delay, but I suppose it's not really a pressing issue as this is certainly a corner-case. Anyhow, I was unable to locate any specifics on unusual classifiers in the literature beyond what I've already posted, however http://ibrarian.net/navon/paper/Recall__Precision_and_Average_Precision_Mu_Zhu.pdf?paperid=8275510 defines AP as:

equation

where equation is the change in the recall from i − 1 to i. ... It's not immediately clear from the example where the i's begin and end, ie. is there an item 11? But if we take the (oh dear, I hope I never meet you, evil classifier) example:

y_true = [1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0]
y_pred = [.9,.9,.9,.9,.9,.9,.9,.9,.9,.9,.8,.8,.8,.8,.8,.7,.7,.7,.7,.7,.6,.6,.6,.6,.6]

Then I assume you would build this version of table 1 as:

Item (i) Hit p(i) r equation
.9 1 1/10 1/3 1/3
.8 1 2/15 2/3 1/3
.7 1 3/20 3/3 1/3
.6 0 3/25 3/3 ?

Which would be:

equation

In the metrics though:

average_precision_score(y_true, y_pred)
0.269444444444

Which really doesn't jive with even simple intuition about the above (albeit quite crazy) example.

So I'd say @jnothman that your suspicion that average_precision_score is affected by this function as well is correct. Though, I'm yet to find any conclusive literature on the subject after a fair bit of search. :-/

@jnothman
Copy link
Member

Also reported in #5073. Let's get some resolution here. Ideally we'd like some confirmation from other implementations.

With or without confirmation, we could provide an option precision_at_0 (to precision_recall_curve and average_precision; maybe it should be precision_at_infinity, depending on whether we're talking recalls or thresholds). The current behaviour would be replicated by precision_at_0=1.0, but we could support precision_at_0=None meaning no point would intercept the axis unless there is a threshold for which recall=0, and precision_at_0='nextbest' or something that would just draw a horizontal line to the axis from the next point. Nomenclature suggestions welcome, as are alternative approaches.

We could change the default when we had a consensus of prior work.

@jnothman
Copy link
Member

FWIW, Weka introduces a "fake" point at (0, 0) in ThresholdCurve.getCurve, (although the fact that precision is 0 in this context may only be an artefact of code reuse). However in getPRCArea (comparable to average_precision), it comments "start from the first real p/r pair (not the artificial zero point)". The same data is used for their ROC calculation, although there the "artificial zero point" is treated differently rather than ignored.

@lfiaschi
Copy link

I came to the same problem while working on a difficult classification problem and I would like to highlight that this situation is more problematic than a simple corner case. In fact when using CV or other methods to optimize classifier metaparameters for average_precision, it may be often the case that some parameter combinations produces (bad) classifiers with sharp decision boundaries; these will be however favoured in the current way that metric is computed. My current solution is to reset the value for precision at zero recall = 0 and seems to work well in combination with CV. I tend to favour @jnothman proposal of providing an additional parameter for the precision_at_0 for a more general solution.

@blucena
Copy link
blucena commented Apr 28, 2016

With respect to precision_recall_curve() - since we are dealing with empirical data (not a theoretical distribution), there are 3 possibilities as to what the most extreme point of the curve should be (i.e. the point where recall is closest to 0)

  1. The unique point with the highest score is "in-class". Then recall = 1/(number of in-class cases) and precision = 1 for that threshold.
  2. The unique point with the highest score is "out-of-class". Then recall = 0 and precision = 0 for that threshold.
  3. There are multiple points tied for the highest value. Then recall = (# in-class with that value)/(# total in-class cases) and precision = (# in-class with that value)/(total cases with that value)

In my opinion, it never makes sense to say precision = 1 and recall = 0. If you did not identify a single in-class (no in-class cases cleared the threshold) then how could you possibly claim to have any precision? I think it is a mistake to say that "precision at recall = 0" is always defined. It is either 0 (if the highest ranked case is out-of-class) or recall never reaches 0 (because one or more cases have the highest score in the set). It is not necessary to have a point for a threshold where no cases (in or out-of-class) are identified as positive. On the other hand, it does make sense to have a point where all points are identified as positive, since that gives a baseline performance.

@amueller
Copy link
Member
amueller commented Oct 7, 2016

Is this fixed by #7356? Hm I guess not...

@amueller amueller added the Bug label Oct 11, 2016
@amueller amueller added this to the 0.19 milestone Oct 11, 2016
@ndingwall
Copy link
Contributor
ndingwall commented Jun 7, 2017

@amueller I just saw this thread. Yes, this will be fixed in #7356 since the last (left-most) precision value isn't used: the first term in the sum is the p_1 (r_1 - r_0) where r_0 = 0 and I'm indexing from the smallest recall values.

@GaelVaroquaux
Copy link
Member
GaelVaroquaux commented Jun 7, 2017 via email

@jnothman
Copy link
Member

@GaelVaroquaux I assume I can close. Reopen if I'm wrong.

@tpet
Copy link
tpet commented Jul 20, 2017

The referenced issues and PR's seem to diverge to computation of average precision and related interpolation methods etc. Although it is of course related, always returning precision=1, no matter what data is, is simply incorrect, as @blucena pointed out. I would suggest to keep this issue open until it is fixed. The precision_recall_curve function should not return precision=1 unless there is a decision boundary which separates positive-only examples from the rest.

@ndingwall
Copy link
Contributor

Good point @tpet; average_precision_score was fixed because it's no longer affected by the introduction of the (0, 1) operating point in precision_recall_curve, but that point is still there.

@brechtmann
Copy link
brechtmann commented Nov 10, 2020

Is this Issue still under development? I just ran into the same problem as @trevorstephens.

I came up with the following minimal example:

y_true = [False, True, True, False, False, False, True, False, False, False]
probas_pred = [1, 1, 1, 1, 0, 0, -1, -2, -3, -4]

precision, recall, thresholds = precision_recall_curve(y_true, probas_pred)
plt.plot(recall, precision)

image

Running the same in R using the PRROC package yields a different curve:

y_true = c(FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE)
probas_pred = c(1, 1, 1, 1, 0, 0, -1, -2, -3, -4)

library(PRROC)
pr<-pr.curve(scores.class0 = probas_pred, weights.class0 = y_true, curve = TRUE)
plot(pr)

image

Further information on the R implementation can be found here:
https://cran.r-project.org/web/packages/PRROC/vignettes/PRROC.pdf

@ndingwall
Copy link
Contributor
ndingwall commented Nov 11, 2020

The problem here is not so much with the first interpolating point (although I agree that it's confusing), but with using linear interpolation between operating points. There is no threshold that results in (recall=0.8, precision ~0.4) in the example above, and so linearly interpolating isn't valid. Consider the even more extreme case of a do-nothing predictor:

y_true = [False, True, True, False]
probas_pred = [0.5, 0.5, 0.5, 0.5]
precision, recall, thresholds = precision_recall_curve(y_true, probas_pred)
plt.plot(recall, precision);

Screen Shot 2020-11-10 at 7 18 35 PM

Your choices are to pick a threshold below 0.5 (recall=0, precision=undefined), or above 0.5 (recall=1, precision=0.5). The straight line between them is misleading because it suggests points along that line are possible operating points (perhaps in the limit of infinite data).

The correct interpretation is to plot a step-function:

plt.step(recall, precision, where='post');

Screen Shot 2020-11-10 at 7 21 25 PM

This gives the correct intuition: if you want a recall of e.g. 0.2, you must choose a threshold above 0.5, which means precision = 0.5. For any recall > 0, the precision is 0.5; for recall = 0, precision is undefined (or whatever you decide the precision of an empty set is - it's totally arbitrary and doesn't affect your classifier because who wants recall=0?)

I wrote a whole blog post about this a few years ago if you need to be convinced any more!

@ndingwall
Copy link
Contributor

Since this is a common error, perhaps we could add an instruction in the precision_recall_curve docstring instructing people to plot it using plt.step(recall, precision, where='post') and then close this issue. @amueller what do you think?

@cmarmo cmarmo removed this from the 0.19 milestone Nov 11, 2020
@MuhammedHasan
Copy link
MuhammedHasan commented Dec 9, 2020

The problem is not only plotting @ndingwall . Also, AUC value depends on precision-recall values.

Predictions and targets can be ranked then rank can be used as a score so it will break the tie. However, the order of the target important so should be sorted as well.

import functools
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import auc

def tie_breaker(func):
    @functools.wraps(func)
    def tie_less_metric(y_true, probas_pred, pos_label=None,
                        sample_weight=None, **kwargs):
        if pos_label:
            y_true = [i == pos_label for i in y_true]
        
        # Sort prediction and true pos_label
        y_true, probas_pred = zip(*sorted(
            zip(y_true, probas_pred),
            key=lambda x: (x[1], -x[0]), reverse=True))
        # Convert probs into rank to break the ties
        ranks = [
            (len(probas_pred) - i, prob)
            for i, prob in enumerate(probas_pred)
        ]
        rank = [i[0] for i in ranks]
        precision, recall, thresholds = func(y_true, rank,
                                             pos_label, sample_weight, **kwargs)
        # Convert threshold into original scores
        ranks = dict(ranks)
        thresholds = [ranks[i] for i in thresholds]
        return precision, recall, thresholds

    return tie_less_metric

With tie:

y_true = [False, False, True, True]
probas_pred = [0.5, 0.5, 0.5, 0.5]
precision, recall, thresholds = precision_recall_curve(y_true, probas_pred)
plt.plot(recall, precision)
print(auc(recall, precision), precision, recall, thresholds)
# 0.75 [0.5 1. ] [1. 0.] [0.5]

index

Without tie:

y_true = [False, True, True, False]
probas_pred = [0.5, 0.5, 0.5, 0.5]
tieless_precision_recall_curve = tie_breaker(precision_recall_curve)
precision, recall, thresholds = tieless_precision_recall_curve(y_true, probas_pred)
plt.plot(recall, precision)
print(auc(recall, precision), recall, precision, thresholds)
# 0.29166666666666663 [1.  0.5 0.  0.  0. ] [0.5        0.33333333 0.         0.         1.        ] [0.5, 0.5, 0.5, 0.5]

index1

@ndingwall
Copy link
Contributor

I'm not sure I follow this. The AUC is affected by your choice of interpolation. Here you've used linear interpolation which isn't valid for precision-recall curves: see https://www.biostat.wisc.edu/~page/rocpr.pdf for an explanation.

Predictions and targets can be ranked then rank can be used as a score so it will break the tie.

When you break ties, you're changing the classifier, so you should expect metrics (like average_precision_score to change).

@Hoeze
Copy link
Hoeze 9E81 commented Dec 15, 2020

@ndingwall If you say the step function with where="post" gives the correct curve, shouldn't the AUC then be calculated from the step function as well?

EDIT:
As a demonstration, auc reports 0.2% higher for the step function:

def prc_step(precision, recall, sklearn_mode=True):
    if not sklearn_mode:
        # make sure that input is sorted by recall
        idx = np.argsort(recall)
    else:
        # by default, sklearn reports recall sorted descending
        # => just inverting in this case is faster
        idx = slice(None, None, -1)
    idx

    prec_step = np.zeros((len(precision) * 2) - 1)
    rec_step = np.zeros((len(recall) * 2) - 1)

    prec_step[np.arange(len(precision)) * 2] = precision[idx]
    rec_step[np.arange(len(recall)) * 2] = recall[idx]

    # resemble 'post' step plot
    prec_step[np.arange(len(recall) - 1) * 2 + 1] = precision[idx][1:]
    rec_step[np.arange(len(recall) - 1) * 2 + 1] = recall[idx][:-1]
    
    return prec_step, rec_step

grafik

EDIT2:
This is actually very close to average precision score:

metrics.average_precision_score(y_true, y_pred)

0.03238006997915479

@ndingwall
Copy link
Contributor

shouldn't the AUC then be calculated from the step function as well?

Yes, this is exactly what average_precision_score does since 0.19 (as you noticed). I wrote a whole blog post on this a few years ago if you're interested.

@amueller
Copy link
Member

Is there still any issue with the current state? the plot_precision_recall_curve function actually does the step interpolation, which is the correct thing to do as @ndingwall states. average_precision now uses the formula from wikipedia so the point doesn't influence it any more.
For AUC, linear interpolation is valid and both the plotting and roc_auc use the linear interpolation.
[This whole thing has been a pet-peeve of mine for years, thanks for the blog post @ndingwall, though I guess it's out of date now since I think we fixed all of the issues?].

Maybe we could make the issue "explain what's up with interpolation", though there's now some references in the docs:
https://scikit-learn.org/dev/modules/model_evaluation.html#precision-recall-and-f-measures

@ndingwall
Copy link
Contributor

@amueller, yes, the blog-post should be updated, although I doubt it gets much traffic!

Is there still any issue with the current state?

I think precision/recall is a deceptively confusing topic, and there are a few things that compound common misconceptions:

  • the name precision_recall_curve isn't ideal, but I can't think of a better alternative and changing it would no doubt break a lot of code.
  • the auc documentation suggests that you "see also" precision_recall_curve, even though the existing auc function can't be used to compute anything meaningful with respect to precision-recall. It also says "For an alternative way to summarize a precision-recall curve, see average_precision_score" which suggests that auc itself is a valid way to summarize a precision-recall curve.
  • the average_precision_score documentation isn't explicit enough that auc is not valid for precision-recall curves.
  • the Precision Recall multiclass example uses plt.plot instead of plt.step (the micro-averaged version uses plt.step correctly).

I can work on a PR that either (A) minimally corrects the documentation for auc (explaining that it cannot be used with precision_recall_score and pointing to average_precision_score as an alternative, or (B) add an interpolation argument to auc that can be either "linear" (default) or "step", document that precision_recall_curve uses the latter, and have average_precision_score use auc with the "step" argument.

I think (B) risks getting unwieldy since we'd probably end up wanting to offer "from-left" and "from-right" as options for "step" (for completeness), and before you know it the API is a confusing mess... On the other hand, it would emphasize that linear is just one of several ways to compute the area under a curve described by its coordinates.

That leaves the Precision Recall multiclass documentation. Since the precision_recall_curve docstring says "Note: this implementation is restricted to the binary classification task" and the multiclass example uses the trick of doing 1-vs-rest binary classification for each label, I'd advocate for dropping the multiclass section entirely. We could link to the OneVsRestClassifier if you really want multiclass covered.

@adrinjalali
Copy link
Member

Seems the original issue is actually solved / improved a lot since this issue was opened. Happy to have a fresh issue with concrete points on what can be improved in the docs or the API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

0