8000 Revisit the "chance level" for the different displays · Issue #30352 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Revisit the "chance level" for the different displays #30352

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
glemaitre opened this issue Nov 26, 2024 · 2 comments
Open

Revisit the "chance level" for the different displays #30352

glemaitre opened this issue Nov 26, 2024 · 2 comments

Comments

@glemaitre
Copy link
Member

@e-pet commented on different PRs & issues some interesting fact. I take the opportunity to consolidate some of those comments here.

First, we use the term "chance" that is ambiguous depending of the displays. The term "baseline" would probably be better. In addition, I checked and I think we should make an extra effort on the definition of the baseline for each of the type of plot: for ROC curve, the baseline is "a random classifier assigning the positive class with probability p and the negative class with probability 1 − p" [1] while for the PR curve, the baseline is derived from the "always-positive classifier" where any recall or precision under π should be discarded [1].

It leads to a second where in the PR curve, we plot the horizontal line derived from the always-positive classifier but we don't discard when recall < π. In this case, as mentioned by @e-pet, it might make sense to show the hyperbolic line of the always-positive classifier instead (cf. Fig. 2 in [1]).

@e-pet feel free to add any other points that you wanted to discuss. Here, I wanted to focus on the one that looks critical and could be addressed.

[1] Flach, P., & Kull, M. (2015). Precision-recall-gain curves: PR analysis done right. Advances in neural information processing systems, 28.

@e-pet
Copy link
e-pet commented Dec 1, 2024

Hi @glemaitre, thank you for collecting this here!

I did some further reading in the meantime (and fixed at least one misunderstanding of mine), so let me try to summarize the issue(s).

  • My dismissal of the horizontal line derived from the always-positive classifier was wrong. One can construct it this way: have a "model" return completely random scores, and then tune the decision threshold between 0 and 1. This yields exactly that horizontal line. So: the horizontal line is valid (=associated with actual baseline models one can easily construct at every point along the line), and it is actually based on a 'chance' baseline. (Might be worth pointing this out somewhere in the documentation, if this is not already the case?)
  • Separately from this, Flach and Kull argue that in terms of the area under the PR curve, the baseline to beat is a curve that has $F_1$ score equaling that of the always-positive 8769 classifier at every point along the curve. The interpretation is simple: if your model is below this line, you perform worse than a trivial always-positive classifier in terms of $F_1$. This is where the hyperbolic baseline comes from. I am actually not sure what the equivalent curve in the ROC case would be, or whether it differs from the standard diagonal? Note that at least to my current understanding, this baseline is 'virtual' in the sense that it is not trivially possible to actually construct models with the precision / recall values specified by every point along the curve.
  • To complicate matters further, there is actually a third 'baseline-like' line that one could show: this is the unreachable region in PR space, described here (and also very easy to calculate). Points below this line cannot be reached by any classifier (with important consequences for the validity of naive AUCPR comparisons, as spelled out both in this paper and the Flach/Kull paper).

@ogrisel
Copy link
Member
ogrisel commented Feb 10, 2025

For the ROC curve, what we currently call "chance level" (the diagonal ROC curve) is any non-informative baseline/predictor: a predictor whose predictions do not depend on X. They can be any constant predict_proba predictions: 0, 1, the fraction of the majority class in the training set or any arbitrary constant prediction. Or even a classifier that outputs a random predicted probability at each test point would also have a ROC Curve lying on the diagonal in the limit of a large prediction set.

Personally, I don't like the "chance-level" naming because it is a bit fuzzy, and it's not intuitive how a classifier that constantly predicts 1 or 0 can be related to "chance". I would rather name this "non-informative predictor" or "non-informative baseline" (or event "constant predictor").

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

3 participants
0