-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG] Multi-class Brier Score Loss #18699
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Multi-class Brier Score Loss #18699
Conversation
I think this is the best course of action. Introduce a new Also when trying to call If the user decides to use |
Should be ready for a review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for bearing with me, here is another round of review comments:
sklearn/metrics/_classification.py
Outdated
Hide resolved
the probabilities provided are assumed to be that of the | ||
positive class. The labels in ``y_pred`` are assumed to be | ||
ordered alphabetically, as done by | ||
:class:`preprocessing.LabelBinarizer`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alphabetically => lexicographically.
I find it misleading that we do not respect the order implied by labels
when passed. Maybe we should raise a ValueError or a warning when the users passes labels=
that does not respect the lexicographical order.
Note that if we decide to something else w.r.t. the handling of the y_prob
class order, we would have to update the log_loss
metric accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you correctly identified, this implementation of mutliclass_brier_score_loss
shadows the one from log_loss
. Any change we make will have to be made to both the functions.
However, both the functions use LabelBinarizer to infer labels, so it seems like any warning/error that we raise should be raised there. Right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought, this is a misuse of LabelBinarizer
.
We have the following options:
- Fix
multiclass_brier_score
so that it respects labels - Keep
multiclass_brier_score
as is, but raise a warning/error
If we go with the latter, then we should do the same with log_loss
as well. Or is that a different PR as well?
labels : array-like, default=None | ||
If not provided, labels will be inferred from y_true. If ``labels`` | ||
is ``None`` and ``y_prob`` has shape (n_samples,) the labels are | ||
assumed to be binary and are inferred from ``y_true``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to pass an explicit pos_label
if y_true
has object or string values. Similar to fbeta_score
I believe. @glemaitre let me know if I miss something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But then log_loss
has the same problem... maybe for a subsequent PR then.
sklearn/metrics/_classification.py
Outdated
else: | ||
raise ValueError(f'The number of classes in labels is different ' | ||
f'from that in y_prob. Classes found in ' | ||
f'labels: {lb.classes_}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I am not mistaken, all this input validation code is duplicated from the log_loss
function. Could you please factorize into a private helper method:
y_true, y_prob, labels = _validate_multiclass_probabilistic_prediction(y_true, y_prob, labels)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about the convention here. Should I put this function in sklearn.metrics._base
or in sklearn.metrics._classification
?
[[0.2, 0.7, 0.1], | ||
[0.6, 0.2, 0.2], | ||
[0.6, 0.1, 0.3]]), | ||
.41333333) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please split this test in 2 tests, one for invalid input and error messages and the other for expected numerical values of valid inputs?
Also could you please add a test that check that perfect predictions lead to 0 Brier and perfectly wrong prediction lead to a Brier of 2. (both for a 2-class and a 3-class classification problem).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
case as well. When used for the binary case, `multiclass_brier_score_loss` | ||
returns Brier score that is exactly twice of the value returned by this | ||
function. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please recall the latex equation of the docstring here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I copied the equation for mutliclass_brier_score_loss
as asked. But the docstring for brier_score_loss
did not include the alternate (i.e. original) formula, so added that equation to the docstring as well.
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Apologize for the late reply. I implemented the changes you suggested. However, there are still a few things that need a review/approval: :
from sklearn.metrics import log_loss
y_true = np.array([0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1])
y_pred = np.array([0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0])
def add_dimension(y_prob):
y_prob = y_prob[:, np.newaxis]
y_prob = np.append(1 - y_prob, y_prob, axis=1)
return y_prob
# This is what the test is doing, which gives different results
print(log_loss(y_pred, y_true)) # 15.542609297195847
print(log_loss(y_true, y_pred)) # 15.542649277067358
# alternate way to run the same test, which gives the same results
print(log_loss(y_true, add_dimension(y_pred))) # 15.542449377709806
print(log_loss(y_pred, add_dimension(y_true))) # 15.542449377709806 I have removed |
Reference Issues/PRs
Resolves #16055
What does this implement/fix?
The original formulation for Brier score inherently supports multiclass classification source. This is currently absent in scikit-learn, which restricts Brier score to binary classification. This PR implements Brier Score for multi-class classification.
Notes/Open Questions
There are two different definitions of Brier Score. The one implemented in scikit-learn, which is only applicable for the binary case is:

(where y_hat is the probability of the positive class)
Whereas, the original, and more universal implementation, that is applicable for both the binary and the multi-class case is:

The range of values for the former is [0,1], whereas for the latter, it is [0,2]. For backwards compatibility, this PR uses the old definition for the binary case, and the new one for the multi-class. However, this implementation seems unintuitive, since the range of values changes. I can see the following workarounds:
While the current PR implements method 1., I personally lean towards method 3.