-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Improve documentation: consistent scoring functions #10584
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Labels
Comments
interesting. I'll need to read up more on these topics, but perhaps you'd
like to propose something more specific. One option is a "which metric
should I use?" section of the user guide, another is advantages and
disadvantages sections under each metric as we have for classification. The
advantage of the former is that it is straightforward to use, and makes
clear what questions are involved in the decision etc. But it might be
harder to maintain.
|
Without yet understanding any of the theory you refer to, we may need to be
clear on "the mean of what" when we say "strictly consistent for the mean".
In cross validation we care about the mean score over validation sets, not
over samples in each validation set. A measure like recall (or balanced
accuracy or AUROC) effectively weights positive and negative samples
differently (note rand accuracy = prevalence-weighted average recall over
classes), so over samples a measure like balanced accuracy explicitly
adopts a weighted mean. Is using a weighted mean over samples a problem? (A
problem with precision is that the weight is
8000
a function of the prediction
under this interpretation.)
|
@jnothman I updated the above text slightly in order to make clear that it is always the mean of the target variable I would go for the first option and write a section "Which metric should I use?". I can make a proposal via a PR, but it may take some time. |
Solved by #11430 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Abstract
Improve the documentation 3.3. Model Evaluation. Give advice to use strictly proper scoring functions.
Explanation
The documentation of scikit-learn is amazing and a strong argument for usage (and fame). I think that one could get a bit lost by choosing the right scoring function or metric out of the many options for model evaluation. There are some influential papers which advocate the usage of strictly proper scoring functions:
Making and Evaluating Point Forecasts, alt. link
Strictly Proper Scoring Rules, Prediction,
and Estimation, alt. link
For classification and regression, most of the time one is interested in the mean functional of the distribution of the target variable
y
. Then, the scoring function used to compare the predictive power of different models (like when choosing the regularization strength via cross validation) should be strictly consistent for the mean functional.Examples
For binary classification knowing the mean of
y
is equivalent to knowing the whole distribution ofy
. Brier (squared error) and logistic (log) loss are strictly consistent for the mean. Accuracy and 0-1 loss are only consistent, but not strictly consistent (they are strictly consistent for the mode which is less informative than the mean for classification). ROC, precision, recall etc. are not consistent for the mean, afaik.For regression, Bregman functions (eq. (18) of 1.) are strictly consistent for the mean. For targets
y
on the whole real numbers, the squared error is one example. For positive-valued targetsy
, the squared error (b=2), Gamma deviance (b=0) and Poisson deviance (b=1) are examples, see eq. (20) of 1.Maybe one should check the many metrics already provided by scikit-learn, if they are (strictly) consistent for a certain functional and if they are equivalent to one another.
Disclaimer
I do not know the authors of the cited papers and I'm not pushing my own research, just my opinion on how to improve this fantastic library 😏
The text was updated successfully, but these errors were encountered: