-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
precision_recall_curve - assumed limits can be misleading #4223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Can you find literature that formally discusses the case of threshold higher than top score (which need not be bounded by 1, as it's not necessarily a probability in this function)? As I understand it, PR curves traditionally descend from P=1, R=0, although perhaps in http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-ranked-retrieval-results-1.html the PR-curve does not actually meet the Y axis (but I'm not sure I'm reading it correctly)...? More important, perhaps, is to check that average precision, as the area under this curve, agrees with definitions elsewhere. |
Shall do a bit of a search over the coming days.
And yes, forgot about predict_log_proba etc. Perhaps if this goes anywhere a function arg like
I'll check it out, looks like some averaging going on at a quick glance.
This is more involved than I thought :-) Will check it out. |
decision_function is more the issue than predict_log_proba; it allows this There is averaging in the red plot in the IR book, but what's happening in Thanks for looking into it. On 10 February 2015 at 14:38, Trevor Stephens notifications@github.com
|
@jnothman , I've found a couple of references that are somewhat relevant, interested in your thoughts so far: The Relationship between Recall and Precision @ 297 cites does not specifically address the question, but has several examples of P-R plots hitting the y-axis at values other than one. (This link may require manual saving as a pdf out of the browser) The This CrossValidated thread is the closest thing I have stumbled across, where quite a few different takes are presented.
Most interesting to me is Rob Hyndman's (somewhat brief) take:
He's a well respected statistician in the energy forecasting space, but no paper to refer to :-/ |
I've just recalled this. You seem chuffed to find Hyndman's opinion in agreement with you. I'm still most curious to see what definitions of average precision say: for users of |
I see what you did there... :-)
It was more the fact I knew his name that caught my eye. He's very active in the area in which I work.
As in a function parameter like |
Sorry for the delay, but I suppose it's not really a pressing issue as this is certainly a corner-case. Anyhow, I was unable to locate any specifics on unusual classifiers in the literature beyond what I've already posted, however http://ibrarian.net/navon/paper/Recall__Precision_and_Average_Precision_Mu_Zhu.pdf?paperid=8275510 defines AP as: where
Then I assume you would build this version of table 1 as:
Which would be: In the
Which really doesn't jive with even simple intuition about the above (albeit quite crazy) example. So I'd say @jnothman that your suspicion that |
Also reported in #5073. Let's get some resolution here. Ideally we'd like some confirmation from other implementations. With or without confirmation, we could provide an option We could change the default when we had a consensus of prior work. |
FWIW, Weka introduces a "fake" point at (0, 0) in |
I came to the same problem while working on a difficult classification problem and I would like to highlight that this situation is more problematic than a simple corner case. In fact when using CV or other methods to optimize classifier metaparameters for |
With respect to precision_recall_curve() - since we are dealing with empirical data (not a theoretical distribution), there are 3 possibilities as to what the most extreme point of the curve should be (i.e. the point where recall is closest to 0)
In my opinion, it never makes sense to say precision = 1 and recall = 0. If you did not identify a single in-class (no in-class cases cleared the threshold) then how could you possibly claim to have any precision? I think it is a mistake to say that "precision at recall = 0" is always defined. It is either 0 (if the highest ranked case is out-of-class) or recall never reaches 0 (because one or more cases have the highest score in the set). It is not necessary to have a point for a threshold where no cases (in or out-of-class) are identified as positive. On the other hand, it does make sense to have a point where all points are identified as positive, since that gives a baseline performance. |
Is this fixed by #7356? Hm I guess not... |
@GaelVaroquaux I assume I can close. Reopen if I'm wrong. |
The referenced issues and PR's seem to diverge to computation of average precision and related interpolation methods etc. Although it is of course related, always returning precision=1, no matter what data is, is simply incorrect, as @blucena pointed out. I would suggest to keep this issue open until it is fixed. The precision_recall_curve function should not return precision=1 unless there is a decision boundary which separates positive-only examples from the rest. |
Good point @tpet; |
Is this Issue still under development? I just ran into the same problem as @trevorstephens. I came up with the following minimal example:
Running the same in R using the PRROC package yields a different curve:
Further information on the R implementation can be found here: |
The problem here is not so much with the first interpolating point (although I agree that it's confusing), but with using linear interpolation between operating points. There is no threshold that results in (recall=0.8, precision ~0.4) in the example above, and so linearly interpolating isn't valid. Consider the even more extreme case of a do-nothing predictor:
Your choices are to pick a threshold below 0.5 (recall=0, precision=undefined), or above 0.5 (recall=1, precision=0.5). The straight line between them is misleading because it suggests points along that line are possible operating points (perhaps in the limit of infinite data). The correct interpretation is to plot a step-function:
This gives the correct intuition: if you want a recall of e.g. 0.2, you must choose a threshold above 0.5, which means precision = 0.5. For any recall > 0, the precision is 0.5; for recall = 0, precision is undefined (or whatever you decide the precision of an empty set is - it's totally arbitrary and doesn't affect your classifier because who wants recall=0?) I wrote a whole blog post about this a few years ago if you need to be convinced any more! |
Since this is a common error, perhaps we could add an instruction in the |
The problem is not only plotting @ndingwall . Also, AUC value depends on precision-recall values. Predictions and targets can be ranked then rank can be used as a score so it will break the tie. However, the order of the target important so should be sorted as well. import functools
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import auc
def tie_breaker(func):
@functools.wraps(func)
def tie_less_metric(y_true, probas_pred, pos_label=None,
sample_weight=None, **kwargs):
if pos_label:
y_true = [i == pos_label for i in y_true]
# Sort prediction and true pos_label
y_true, probas_pred = zip(*sorted(
zip(y_true, probas_pred),
key=lambda x: (x[1], -x[0]), reverse=True))
# Convert probs into rank to break the ties
ranks = [
(len(probas_pred) - i, prob)
for i, prob in enumerate(probas_pred)
]
rank = [i[0] for i in ranks]
precision, recall, thresholds = func(y_true, rank,
pos_label, sample_weight, **kwargs)
# Convert threshold into original scores
ranks = dict(ranks)
thresholds = [ranks[i] for i in thresholds]
return precision, recall, thresholds
return tie_less_metric With tie: y_true = [False, False, True, True]
probas_pred = [0.5, 0.5, 0.5, 0.5]
precision, recall, thresholds = precision_recall_curve(y_true, probas_pred)
plt.plot(recall, precision)
print(auc(recall, precision), precision, recall, thresholds)
# 0.75 [0.5 1. ] [1. 0.] [0.5] Without tie: y_true = [False, True, True, False]
probas_pred = [0.5, 0.5, 0.5, 0.5]
tieless_precision_recall_curve = tie_breaker(precision_recall_curve)
precision, recall, thresholds = tieless_precision_recall_curve(y_true, probas_pred)
plt.plot(recall, precision)
print(auc(recall, precision), recall, precision, thresholds)
# 0.29166666666666663 [1. 0.5 0. 0. 0. ] [0.5 0.33333333 0. 0. 1. ] [0.5, 0.5, 0.5, 0.5] |
I'm not sure I follow this. The AUC is affected by your choice of interpolation. Here you've used linear interpolation which isn't valid for precision-recall curves: see https://www.biostat.wisc.edu/~page/rocpr.pdf for an explanation.
When you break ties, you're changing the classifier, so you should expect metrics (like |
@ndingwall If you say the step function with EDIT: def prc_step(precision, recall, sklearn_mode=True):
if not sklearn_mode:
# make sure that input is sorted by recall
idx = np.argsort(recall)
else:
# by default, sklearn reports recall sorted descending
# => just inverting in this case is faster
idx = slice(None, None, -1)
idx
prec_step = np.zeros((len(precision) * 2) - 1)
rec_step = np.zeros((len(recall) * 2) - 1)
prec_step[np.arange(len(precision)) * 2] = precision[idx]
rec_step[np.arange(len(recall)) * 2] = recall[idx]
# resemble 'post' step plot
prec_step[np.arange(len(recall) - 1) * 2 + 1] = precision[idx][1:]
rec_step[np.arange(len(recall) - 1) * 2 + 1] = recall[idx][:-1]
return prec_step, rec_step EDIT2: metrics.average_precision_score(y_true, y_pred)
|
Yes, this is exactly what |
Is there still any issue with the current state? the Maybe we could make the issue "explain what's up with interpolation", though there's now some references in the docs: |
@amueller, yes, the blog-post should be updated, although I doubt it gets much traffic!
I think precision/recall is a deceptively confusing topic, and there are a few things that compound common misconceptions:
I can work on a PR that either (A) minimally corrects the documentation for I think (B) risks getting unwieldy since we'd probably end up wanting to offer "from-left" and "from-right" as options for That leaves the Precision Recall multiclass documentation. Since the |
Seems the original issue is actually solved / improved a lot since this issue was opened. Happy to have a fresh issue with concrete points on what can be improved in the docs or the API. |
Referencing
"The last precision and recall values are 1. and 0. respectively and do not have a corresponding threshold. This ensures that the graph starts on the x axis."
Say you have a tie in your highest predicted probability output from the classifier with some false positives and some true positives, for example:
Yields:
I understand that recall goes to zero in the limit, but precision might not be so clear-cut. A really difficult problem where your top prediction is a false positive would have precision go to zero in the limit I believe. For plotting with this function, this case is probably not a big deal a vertical line from 0 to 1 would be hidden by the y-axis.
But for the drawn top-preds, as seen above, the plot becomes misleading and makes the viewer think that at least their top prediction was true positive. This may seem like a corner case, but I ran into it on a tough classification problem I was working on and was a little baffled by the output until I checked the code out.
A quick fix to preserve the intention of a clean P-R plot might be to draw a horizontal line from the highest threshold to the y-axis, this should not alter the output of most cases like here. Trying to actually calculate where it's going in the limit might be a bit overkill :-)
I would also think that adding a
1
to the end of thethresholds
vector would be helpful when plotting both precision and recall against the probabilities.I'm happy to open a PR if anyone thinks this is worth addressing.
The text was updated successfully, but these errors were encountered: