@@ -1344,30 +1344,30 @@ probability outputs (``predict_proba``) of a classifier instead of its
1344
1344
discrete predictions.
1345
1345
1346
1346
For binary classification with a true label :math: `y \in \{ 0 ,1 \}`
1347
- and a probability estimate :math: `p = \operatorname {Pr}(y = 1 )`,
1347
+ and a probability estimate :math: `\hat {p} \approx \operatorname {Pr}(y = 1 )`,
1348
1348
the log loss per sample is the negative log-likelihood
1349
1349
of the classifier given the true label:
1350
1350
1351
1351
.. math ::
1352
1352
1353
- L_{\log }(y, p ) = -\log \operatorname {Pr}(y|p ) = -(y \log (p ) + (1 - y) \log (1 - p ))
1353
+ L_{\log }(y, \hat {p} ) = -\log \operatorname {Pr}(y|\hat {p} ) = -(y \log (\hat {p} ) + (1 - y) \log (1 - \hat {p} ))
1354
1354
1355
1355
This extends to the multiclass case as follows.
1356
1356
Let the true labels for a set of samples
1357
1357
be encoded as a 1-of-K binary indicator matrix :math: `Y`,
1358
1358
i.e., :math: `y_{i,k} = 1 ` if sample :math: `i` has label :math: `k`
1359
1359
taken from a set of :math: `K` labels.
1360
- Let :math: `P ` be a matrix of probability estimates,
1361
- with :math: `p_{ i,k} = \operatorname {Pr}(y_{i,k} = 1 )`.
1360
+ Let :math: `\hat {P} ` be a matrix of probability estimates,
1361
+ with elements :math: `\hat {p}_{ i,k} \approx \operatorname {Pr}(y_{i,k} = 1 )`.
1362
1362
Then the log loss of the whole set is
1363
1363
1364
1364
.. math ::
1365
1365
1366
- L_{\log }(Y, P ) = -\log \operatorname {Pr}(Y|P ) = - \frac {1 }{N} \sum _{i=0 }^{N-1 } \sum _{k=0 }^{K-1 } y_{i,k} \log p_ {i,k}
1366
+ L_{\log }(Y, \hat {P} ) = -\log \operatorname {Pr}(Y|\hat {P} ) = - \frac {1 }{N} \sum _{i=0 }^{N-1 } \sum _{k=0 }^{K-1 } y_{i,k} \log \hat {p}_ {i,k}
1367
1367
1368
1368
To see how this generalizes the binary log loss given above,
1369
1369
note that in the binary case,
1370
- :math: `p_{ i,0 } = 1 - p_ {i,1 }` and :math: `y_{i,0 } = 1 - y_{i,1 }`,
1370
+ :math: `\hat {p}_{ i,0 } = 1 - \hat {p}_ {i,1 }` and :math: `y_{i,0 } = 1 - y_{i,1 }`,
1371
1371
so expanding the inner sum over :math: `y_{i,k} \in \{ 0 ,1 \}`
1372
1372
gives the binary log loss.
1373
1373
@@ -1923,41 +1923,64 @@ set [0,1] has an error::
1923
1923
Brier score loss
1924
1924
----------------
1925
1925
1926
- The :func: `brier_score_loss ` function computes the
1927
- `Brier score <https://en.wikipedia.org/wiki/Brier_score >`_
1928
- for binary classes [Brier1950 ]_. Quoting Wikipedia:
1926
+ The :func: `brier_score_loss ` function computes the `Brier score
1927
+ <https://en.wikipedia.org/wiki/Brier_score> `_ for binary and multiclass
1928
+ probabilistic predictions and is equivalent to the mean squared error.
1929
+ Quoting Wikipedia:
1929
1930
1930
- "The Brier score is a proper score function that measures the accuracy of
1931
- probabilistic predictions. It is applicable to tasks in which predictions
1932
- must assign probabilities to a set of mutually exclusive discrete outcomes."
1931
+ "The Brier score is a strictly proper scoring rule that measures the accuracy of
1932
+ probabilistic predictions. [...] [It] is applicable to tasks in which predictions
1933
+ must assign probabilities to a set of mutually exclusive discrete outcomes or
1934
+ classes."
1933
1935
1934
- This function returns the mean squared error of the actual outcome
1935
- :math: `y \in \{ 0 ,1 \}` and the predicted probability estimate
1936
- :math: `p = \operatorname {Pr}(y = 1 )` (:term: `predict_proba `) as outputted by:
1936
+ Let the true labels for a set of :math: `N` data points be encoded as a 1-of-K binary
1937
+ indicator matrix :math: `Y`, i.e., :math: `y_{i,k} = 1 ` if sample :math: `i` has
1938
+ label :math: `k` taken from a set of :math: `K` labels. Let :math: `\hat {P}` be a matrix
1939
+ of probability estimates with elements :math: `\hat {p}_{i,k} \approx \operatorname {Pr}(y_{i,k} = 1 )`.
1940
+ Following the original definition by [Brier1950 ]_, the Brier score is given by:
1937
1941
1938
1942
.. math ::
1939
1943
1940
- BS = \frac {1 }{n_{ \text {samples}}} \sum _{i=0 }^{n_{ \text {samples}} - 1 }(y_i - p_i)^ 2
1944
+ BS(Y, \hat {P}) = \frac {1 }{N} \sum _{i=0 }^{N- 1 } \sum _{k= 0 }^{K- 1 }(y_{i,k} - \hat {p}_{i,k})^{ 2 }
1941
1945
1942
- The Brier score loss is also between 0 to 1 and the lower the value (the mean
1943
- square difference is smaller), the more accurate the prediction is.
1946
+ The Brier score lies in the interval :math: `[0 , 2 ]` and the lower the value the
1947
+ better the probability estimates are (the mean squared difference is smaller).
1948
+ Actually, the Brier score is a strictly proper scoring rule, meaning that it
1949
+ achieves the best score only when the estimated probabilities equal the
1950
+ true ones.
1951
+
1952
+ Note that in the binary case, the Brier score is usually divided by two and
1953
+ ranges between :math: `[0 ,1 ]`. For binary targets :math: `y_i \in {0 , 1 }` and
1954
+ probability estimates :math: `\hat {p}_i \approx \operatorname {Pr}(y_i = 1 )`
1955
+ for the positive class, the Brier score is then equal to:
1956
+
1957
+ .. math ::
1958
+
1959
+ BS(y, \hat {p}) = \frac {1 }{N} \sum _{i=0 }^{N - 1 }(y_i - \hat {p}_i)^2
1960
+
1961
+ The :func: `brier_score_loss ` function computes the Brier score given the
1962
+ ground-truth labels and predicted probabilities, as returned by an estimator's
1963
+ ``predict_proba `` method. The `scale_by_half ` parameter controls which of the
1964
+ two above definitions to follow.
1944
1965
1945
- Here is a small example of usage of this function::
1946
1966
1947
1967
>>> import numpy as np
1948
1968
>>> from sklearn.metrics import brier_score_loss
1949
1969
>>> y_true = np.array([0 , 1 , 1 , 0 ])
1950
1970
>>> y_true_categorical = np.array([" spam" , " ham" , " ham" , " spam" ])
1951
1971
>>> y_prob = np.array([0.1 , 0.9 , 0.8 , 0.4 ])
1952
- >>> y_pred = np.array([0, 1, 1, 0])
1953
1972
>>> brier_score_loss(y_true, y_prob)
1954
1973
0.055
1955
1974
>>> brier_score_loss(y_true, 1 - y_prob, pos_label = 0 )
1956
1975
0.055
1957
1976
>>> brier_score_loss(y_true_categorical, y_prob, pos_label = " ham" )
1958
1977
0.055
1959
- >>> brier_score_loss(y_true, y_prob > 0.5)
1960
- 0.0
1978
+ >>> brier_score_loss(
1979
+ ... [" eggs" , " ham" , " spam" ],
1980
+ ... [[0.8 , 0.1 , 0.1 ], [0.2 , 0.7 , 0.1 ], [0.2 , 0.2 , 0.6 ]],
1981
+ ... labels= [" eggs" , " ham" , " spam" ],
1982
+ ... )
1983
+ 0.146...
1961
1984
1962
1985
The Brier score can be used to assess how well a classifier is calibrated.
1963
1986
However, a lower Brier score loss does not always mean a better calibration.
0 commit comments