8000 ENH Add Multiclass Brier Score Loss by ogrisel · Pull Request #22046 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

ENH Add Multiclass Brier Score Loss #22046

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 71 commits into from
Mar 20, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
71 commits
Select commit Hold shift + click to select a range
f630718
add multi-class support
Oct 26, 2020
e08d4f4
fix swapped y_true y_prob
Oct 26, 2020
eff8854
fix docstring
Oct 26, 2020
d864395
fix docstring
Oct 26, 2020
32ab60a
fix variable name spelling
Oct 26, 2020
6e73c0d
add tests
Oct 26, 2020
7ce3f85
merge upstream
Oct 26, 2020
9cd4247
import re
Oct 26, 2020
1369945
fix docstring
Oct 28, 2020
a183d06
fix linting
Oct 28, 2020
08688d3
fix linting
Oct 28, 2020
4f8a5f2
remove unused import
Oct 28, 2020
7b51433
add multiclass_brier_score_loss
Nov 2, 2020
d5c90bf
add tests
Nov 2, 2020
2243828
fix docstring
Nov 2, 2020
9893101
Merge remote-tracking branch 'upstream/master' into multiclass_brier_…
Nov 2, 2020
3e4465f
use f-strings
Nov 2, 2020
eafda42
fix tests
Nov 2, 2020
038abf7
fix error message
Nov 2, 2020
838f827
fix docstring
Nov 2, 2020
5ef41c7
fix linting
Nov 2, 2020
4fb4c4f
Update sklearn/metrics/_classification.py
aggvarun01 Nov 4, 2020
3260bf3
Apply suggestions from code review
aggvarun01 Nov 4, 2020
86d793e
split tests
Nov 5, 2020
411ec1a
add private function
Nov 6, 2020
f84493c
add warning for labels
Nov 11, 2020
79f014d
Merge remote-tracking branch 'origin/main' into multiclass_brier_scor…
ogrisel Dec 21, 2021
50f50ef
Fix multiclass_brier_score_loss docstring sections order
ogrisel Dec 21, 2021
884c434
Add entry in the changelog
ogrisel Dec 21, 2021
cdc4cc9
Update multiclass calibration example
ogrisel Dec 21, 2021
9bd4ad1
Register a new scorer
ogrisel Dec 21, 2021
bae82ee
Merge remote-tracking branch 'origin/main' into multiclass_brier_scor…
ogrisel Dec 31, 2021
28c313f
Merge remote-tracking branch 'upstream/main' into multiclass_brier_sc…
antoinebaker Nov 6, 2024
478c568
fix doctest and matched errors
antoinebaker Nov 6, 2024
2ee8fd7
changelog
antoinebaker Nov 6, 2024
b6e8344
fix changelog
antoinebaker Nov 6, 2024
da9e8c6
Merge remote-tracking branch 'upstream/main' into multiclass_brier_sc…
antoinebaker Dec 2, 2024
b60cb3b
add normalization keyword
antoinebaker Dec 4, 2024
2754db7
Merge remote-tracking branch 'upstream/main' into multiclass_brier_sc…
antoinebaker Dec 4, 2024
dcde0d4
document normalize
antoinebaker Dec 4, 2024
a0baefb
add scale_by_half
antoinebaker Dec 11, 2024
3311ca8
update doc
antoinebaker Dec 11, 2024
bfa89dd
changelog
antoinebaker Dec 11, 2024
4f00c63
Merge branch 'main' into multiclass_brier_score_loss
antoinebaker Dec 11, 2024
1a88de7
improve test coverage
antoinebaker Dec 13, 2024
42c567c
Apply suggestions from code review
antoinebaker Dec 31, 2024
eb446b8
rewrap
antoinebaker Dec 31, 2024
5c2cf3d
changelog
antoinebaker Dec 31, 2024
a764d1f
add labels
antoinebaker Dec 31, 2024
63548bf
Update sklearn/metrics/tests/test_classification.py
antoinebaker Dec 31, 2024
b2d2ba9
formatting
antoinebaker Dec 31, 2024
46dec65
Update sklearn/metrics/tests/test_common.py
antoinebaker Jan 3, 2025
6fa3b5d
Apply suggestions from code review
antoinebaker Jan 3, 2025
c91aef1
versionadded
antoinebaker Jan 3, 2025
b89d077
more explicit warning
antoinebaker Jan 3, 2025
c05e891
incomplete labels
antoinebaker Jan 3, 2025
a65b9c3
Merge remote-tracking branch 'upstream/main' into multiclass_brier_sc…
antoinebaker Jan 6, 2025
1054b83
remove log_loss mention
antoinebaker Jan 6, 2025
c163b6a
fix doctest
antoinebaker Jan 6, 2025
01fa561
update test_common
antoinebaker Jan 8, 2025
e45a660
Merge remote-tracking branch 'upstream/main' into multiclass_brier_sc…
antoinebaker Feb 6, 2025
4e27bbf
return float
antoinebaker Feb 6, 2025
a5b448b
changelog
antoinebaker Feb 6, 2025
ef0bbe8
Merge branch 'main' into multiclass_brier_score_loss
antoinebaker Feb 6, 2025
3595b8e
Apply suggestions from code review
antoinebaker Mar 17, 2025
58e5f18
doc
antoinebaker Mar 19, 2025
6de5e13
symmetry tests
antoinebaker Mar 19, 2025
653d4ae
test y_proba with two columns
antoinebaker Mar 19, 2025
242ee3e
Merge branch 'main' into multiclass_brier_score_loss
antoinebaker Mar 19, 2025
6359c7d
Merge branch 'main' into multiclass_brier_score_loss
antoinebaker Mar 20, 2025
e3e406c
Apply suggestions from code review
antoinebaker Mar 20, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
67 changes: 45 additions & 22 deletions doc/modules/model_evaluation.rst
8000
Original file line number Diff line number Diff line change
Expand Up @@ -1344,30 +1344,30 @@ probability outputs (``predict_proba``) of a classifier instead of its
discrete predictions.

For binary classification with a true label :math:`y \in \{0,1\}`
and a probability estimate :math:`p = \operatorname{Pr}(y = 1)`,
and a probability estimate :math:`\hat{p} \approx \operatorname{Pr}(y = 1)`,
the log loss per sample is the negative log-likelihood
of the classifier given the true label:

.. math::

L_{\log}(y, p) = -\log \operatorname{Pr}(y|p) = -(y \log (p) + (1 - y) \log (1 - p))
L_{\log}(y, \hat{p}) = -\log \operatorname{Pr}(y|\hat{p}) = -(y \log (\hat{p}) + (1 - y) \log (1 - \hat{p}))

This extends to the multiclass case as follows.
Let the true labels for a set of samples
be encoded as a 1-of-K binary indicator matrix :math:`Y`,
i.e., :math:`y_{i,k} = 1` if sample :math:`i` has label :math:`k`
taken from a set of :math:`K` labels.
Let :math:`P` be a matrix of probability estimates,
with :math:`p_{i,k} = \operatorname{Pr}(y_{i,k} = 1)`.
Let :math:`\hat{P}` be a matrix of probability estimates,
with elements :math:`\hat{p}_{i,k} \approx \operatorname{Pr}(y_{i,k} = 1)`.
Then the log loss of the whole set is

.. math::

L_{\log}(Y, P) = -\log \operatorname{Pr}(Y|P) = - \frac{1}{N} \sum_{i=0}^{N-1} \sum_{k=0}^{K-1} y_{i,k} \log p_{i,k}
L_{\log}(Y, \hat{P}) = -\log \operatorname{Pr}(Y|\hat{P}) = - \frac{1}{N} \sum_{i=0}^{N-1} \sum_{k=0}^{K-1} y_{i,k} \log \hat{p}_{i,k}

To see how this generalizes the binary log loss given above,
note that in the binary case,
:math:`p_{i,0} = 1 - p_{i,1}` and :math:`y_{i,0} = 1 - y_{i,1}`,
:math:`\hat{p}_{i,0} = 1 - \hat{p}_{i,1}` and :math:`y_{i,0} = 1 - y_{i,1}`,
so expanding the inner sum over :math:`y_{i,k} \in \{0,1\}`
gives the binary log loss.

Expand Down Expand Up @@ -1923,41 +1923,64 @@ set [0,1] has an error::
Brier score loss
----------------

The :func:`brier_score_loss` function computes the
`Brier score <https://en.wikipedia.org/wiki/Brier_score>`_
for binary classes [Brier1950]_. Quoting Wikipedia:
The :func:`brier_score_loss` function computes the `Brier score
<https://en.wikipedia.org/wiki/Brier_score>`_ for binary and multiclass
probabilistic predictions and is equivalent to the mean squared error.
Quoting Wikipedia:

"The Brier score is a proper score function that measures the accuracy of
probabilistic predictions. It is applicable to tasks in which predictions
must assign probabilities to a set of mutually exclusive discrete outcomes."
"The Brier score is a strictly proper scoring rule that measures the accuracy of
probabilistic predictions. [...] [It] is applicable to tasks in which predictions
must assign probabilities to a set of mutually exclusive discrete outcomes or
classes."

This function returns the mean squared error of the actual outcome
:math:`y \in \{0,1\}` and the predicted probability estimate
:math:`p = \operatorname{Pr}(y = 1)` (:term:`predict_proba`) as outputted by:
Let the true labels for a set of :math:`N` data points be encoded as a 1-of-K binary
indicator matrix :math:`Y`, i.e., :math:`y_{i,k} = 1` if sample :math:`i` has
label :math:`k` taken from a set of :math:`K` labels. Let :math:`\hat{P}` be a matrix
of probability estimates with elements :math:`\hat{p}_{i,k} \approx \operatorname{Pr}(y_{i,k} = 1)`.
Following the original definition by [Brier1950]_, the Brier score is given by:

.. math::

BS = \frac{1}{n_{\text{samples}}} \sum_{i=0}^{n_{\text{samples}} - 1}(y_i - p_i)^2
BS(Y, \hat{P}) = \frac{1}{N}\sum_{i=0}^{N-1}\sum_{k=0}^{K-1}(y_{i,k} - \hat{p}_{i,k})^{2}

The Brier score loss is also between 0 to 1 and the lower the value (the mean
square difference is smaller), the more accurate the prediction is.
The Brier score lies in the interval :math:`[0, 2]` and the lower the value the
better the probability estimates are (the mean squared difference is smaller).
Actually, the Brier score is a strictly proper scoring rule, meaning that it
achieves the best score only when the estimated probabilities equal the
true ones.

Note that in the binary case, the Brier score is usually divided by two and
ranges between :math:`[0,1]`. For binary targets :math:`y_i \in {0, 1}` and
probability estimates :math:`\hat{p}_i \approx \operatorname{Pr}(y_i = 1)`
for the positive class, the Brier score is then equal to:

.. math::

BS(y, \hat{p}) = \frac{1}{N} \sum_{i=0}^{N - 1}(y_i - \hat{p}_i)^2

The :func:`brier_score_loss` function computes the Brier score given the
ground-truth labels and predicted probabilities, as returned by an estimator's
``predict_proba`` method. The `scale_by_half` parameter controls which of the
two above definitions to follow.

Here is a small example of usage of this function::

>>> import numpy as np
>>> from sklearn.metrics import brier_score_loss
>>> y_true = np.array([0, 1, 1, 0])
>>> y_true_categorical = np.array(["spam", "ham", "ham", "spam"])
>>> y_prob = np.array([0.1, 0.9, 0.8, 0.4])
>>> y_pred = np.array([0, 1, 1, 0])
>>> brier_score_loss(y_true, y_prob)
0.055
>>> brier_score_loss(y_true, 1 - y_prob, pos_label=0)
0.055
>>> brier_score_loss(y_true_categorical, y_prob, pos_label="ham")
0.055
>>> brier_score_loss(y_true, y_prob > 0.5)
0.0
>>> brier_score_loss(
... ["eggs", "ham", "spam"],
... [[0.8, 0.1, 0.1], [0.2, 0.7, 0.1], [0.2, 0.2, 0.6]],
... labels=["eggs", "ham", "spam"],
... )
0.146...

The Brier score can be used to assess how well a classifier is calibrated.
However, a lower Brier score loss does not always mean a better calibration.
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
- :func:`metrics.brier_score_loss` implements the Brier score for multiclass
classification problems and adds a `scale_by_half` argument. This metric is
notably useful to assess both sharpness and calibration of probabilistic
classifiers. See the docstrings for more details. By
:user:`Varun Aggarwal <aggvarun01>`, :user:`Olivier Grisel <ogrisel>` and
:user:`Antoine Baker <antoinebaker>`.
3 changes: 3 additions & 0 deletions doc/whats_new/upcoming_changes/sklearn.metrics/22046.fix.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
- :func:`metrics.log_loss` now raises a `ValueError` if values of `y_true`
are missing in `labels`. By :user:`Varun Aggarwal <aggvarun01>`,
:user:`Olivier Grisel <ogrisel>` and :user:`Antoine Baker <antoinebaker>`.
38 changes: 33 additions & 5 deletions examples/calibration/plot_calibration_multiclass.py
Original file line number Diff line number Diff line change
Expand Up @@ -212,14 +212,30 @@ class of an instance (red: class 1, green: class 2, blue: class 3).

from sklearn.metrics import log_loss

score = log_loss(y_test, clf_probs)
cal_score = log_loss(y_test, cal_clf_probs)
loss = log_loss(y_test, clf_probs)
cal_loss = log_loss(y_test, cal_clf_probs)

print("Log-loss of")
print(f" * uncalibrated classifier: {score:.3f}")
print(f" * calibrated classifier: {cal_score:.3f}")
print("Log-loss of:")
print(f" - uncalibrated classifier: {loss:.3f}")
print(f" - calibrated classifier: {cal_loss:.3f}")

# %%
# We can also assess calibration with the Brier score for probabilistics predictions
# (lower is better, possible range is [0, 2]):

from sklearn.metrics import brier_score_loss

loss = brier_score_loss(y_test, clf_probs)
cal_loss = brier_score_loss(y_test, cal_clf_probs)

print("Brier score of")
print(f" - uncalibrated classifier: {loss:.3f}")
print(f" - calibrated classifier: {cal_loss:.3f}")

# %%
# According to the Brier score, the calibrated classifier is not better than
# the original model.
#
# Finally we generate a grid of possible uncalibrated probabilities over
# the 2-simplex, compute the corresponding calibrated probabilities and
# plot arrows for each. The arrows are colored according the highest
Expand Down Expand Up @@ -274,3 +290,15 @@ class of an instance (red: class 1, green: class 2, blue: class 3).
plt.ylim(-0.05, 1.05)

plt.show()

# %%
# One can observe that, on average, the calibrator is pushing highly confident
# predictions away from the boundaries of the simplex while simultaneously
# moving uncertain predictions towards one of three modes, one for each class.
# We can also observe that the mapping is not symmetric. Furthermore some
# arrows seems to cross class assignment boundaries which is not necessarily
# what one would expect from a calibration map as it means that some predicted
# classes will change after calibration.
#
# All in all, the One-vs-Rest multiclass-calibration strategy implemented in
# `CalibratedClassifierCV` should not be trusted blindly.
Loading
Loading
0