8000 DOC improve calibration user guide (#25687) · scikit-learn/scikit-learn@e3e880f · GitHub
[go: up one dir, main page]

Skip to content

Commit e3e880f

Browse files
authored
DOC improve calibration user guide (#25687)
1 parent 1ae26d7 commit e3e880f

File tree

1 file changed

+43
-33
lines changed

1 file changed

+43
-33
lines changed

doc/modules/calibration.rst

Lines changed: 43 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -30,16 +30,20 @@ approximately 80% actually belong to the positive class.
3030
Calibration curves
3131
------------------
3232

33-
Calibration curves (also known as reliability diagrams) compare how well the
34-
probabilistic predictions of a binary classifier are calibrated. It plots
35-
the true frequency of the positive label against its predicted probability,
36-
for binned predictions.
37-
The x axis represents the average predicted probability in each bin. The
38-
y axis is the *fraction of positives*, i.e. the proportion of samples whose
39-
class is the positive class (in each bin). The top calibration curve plot
40-
is created with :func:`CalibrationDisplay.from_estimators`, which uses
41-
:func:`calibration_curve` to calculate the per bin average predicted
42-
probabilities and fraction of positives.
33+
Calibration curves, also referred to as *reliability diagrams* (Wilks 1995 [2]_),
34+
compare how well the probabilistic predictions of a binary classifier are calibrated.
35+
It plots the frequency of the positive label (to be more precise, an estimation of the
36+
*conditional event probability* :math:`P(Y=1|\text{predict\_proba})`) on the y-axis
37+
against the predicted probability :term:`predict_proba` of a model on the x-axis.
38+
The tricky part is to get values for the y-axis.
39+
In scikit-learn, this is accomplished by binning the predictions such that the x-axis
40+
represents the average predicted probability in each bin.
41+
The y-axis is then the *fraction of positives* given the predictions of that bin, i.e.
42+
the proportion of samples whose class is the positive class (in each bin).
43+
44+
The top calibration curve plot is created with
45+
:func:`CalibrationDisplay.from_estimator`, which uses :func:`calibration_curve` to
46+
calculate the per bin average predicted probabilities and fraction of positives.
4347
:func:`CalibrationDisplay.from_estimator`
4448
takes as input a fitted classifier, which is used to calculate the predicted
4549
probabilities. The classifier thus must have :term:`predict_proba` method. For
@@ -56,21 +60,24 @@ by showing the number of samples in each predicted probability bin.
5660

5761
.. currentmodule:: sklearn.linear_model
5862

59-
:class:`LogisticRegression` returns well calibrated predictions by default as it directly
60-
optimizes :ref:`log_loss`. In contrast, the other methods return biased probabilities;
61-
with different biases per method:
63+
:class:`LogisticRegression` returns well c 8000 alibrated predictions by default as it has a
64+
canonical link function for its loss, i.e. the logit-link for the :ref:`log_loss`.
65+
This leads to the so-called **balance property**, see [7]_ and
66+
:ref:`Logistic_regression`.
67+
In contrast to that, the other shown models return biased probabilities; with
68+
different biases per model.
6269

6370
.. currentmodule:: sklearn.naive_bayes
6471

65-
:class:`GaussianNB` tends to push probabilities to 0 or 1 (note the counts
72+
:class:`GaussianNB` (Naive Bayes) tends to push probabilities to 0 or 1 (note the counts
6673
in the histograms). This is mainly because it makes the assumption that
6774
features are conditionally independent given the class, which is not the
6875
case in this dataset which contains 2 redundant features.
6976

7077
.. currentmodule:: sklearn.ensemble
7178

7279
:class:`RandomForestClassifier` shows the opposite behavior: the histograms
73-
show peaks at approximately 0.2 and 0.9 probability, while probabilities
80+
show peaks at probabilities approximately 0.2 and 0.9, while probabilities
7481
close to 0 or 1 are very rare. An explanation for this is given by
7582
Niculescu-Mizil and Caruana [1]_: "Methods such as bagging and random
7683
forests that average predictions from a base set of models can have
@@ -85,18 +92,16 @@ predict values larger than 0 for this case, thus moving the average
8592
prediction of the bagged ensemble away from 0. We observe this effect most
8693
strongly with random forests because the base-level trees trained with
8794
random forests have relatively high variance due to feature subsetting." As
88-
a result, the calibration curve also referred to as the reliability diagram
89-
(Wilks 1995 [2]_) shows a characteristic sigmoid shape, indicating that the
90-
classifier could trust its "intuition" more and return probabilities closer
95+
a result, the calibration curve shows a characteristic sigmoid shape, indicating that
96+
the classifier could trust its "intuition" more and return probabilities closer
9197
to 0 or 1 typically.
9298

9399
.. currentmodule:: sklearn.svm
94100

95-
Linear Support Vector Classification (:class:`LinearSVC`) shows an even more
96-
sigmoid curve than :class:`~sklearn.ensemble.RandomForestClassifier`, which is
97-
typical for maximum-margin methods (compare Niculescu-Mizil and Caruana [1]_),
98-
which focus on difficult to classify samples that are close to the decision
99-
boundary (the support vectors).
101+
:class:`LinearSVC` (SVC) shows an even more sigmoid curve than the random forest, which
102+
is typical for maximum-margin methods (compare Niculescu-Mizil and Caruana [1]_), which
103+
focus on difficult to classify samples that are close to the decision boundary (the
104+
support vectors).
100105

101106
Calibrating a classifier
102107
------------------------
@@ -107,10 +112,11 @@ Calibrating a classifier consists of fitting a regressor (called a
107112
*calibrator*) that maps the output of the classifier (as given by
108113
:term:`decision_function` or :term:`predict_proba`) to a calibrated probability
109114
in [0, 1]. Denoting the output of the classifier for a given sample by :math:`f_i`,
110-
the calibrator tries to predict :math:`p(y_i = 1 | f_i)`.
115+
the calibrator tries to predict the conditional event probability
116+
:math:`P(y_i = 1 | f_i)`.
111117

112-
The samples that are used to fit the calibrator should not be the same
113-
samples used to fit the classifier, as this would introduce bias.
118+
Ideally, the calibrator is fit on a dataset independent of the training data used to
119+
fit the classifier in the first place.
114120
This is because performance of the classifier on its training data would be
115121
better than for novel data. Using the classifier output of training data
116122
to fit the calibrator would thus result in a biased calibrator that maps to
@@ -200,22 +206,21 @@ the classifier output for each binary class is normally distributed with
200206
the same variance [6]_. This can be a problem for highly imbalanced
201207
classification problems, where outputs do not have equal variance.
202208

203-
In general this method is most effective when the un-calibrated model is
204-
under-confident and has similar calibration errors for both high and low
205-
outputs.
209+
In general this method is most effective for small sample sizes or when the
210+
un-calibrated model is under-confident and has similar calibration errors for both
211+
high and low outputs.
206212

207213
Isotonic
208214
^^^^^^^^
209215

210216
The 'isotonic' method fits a non-parametric isotonic regressor, which outputs
211-
a step-wise non-decreasing function (see :mod:`sklearn.isotonic`). It
212-
minimizes:
217+
a step-wise non-decreasing function, see :mod:`sklearn.isotonic`. It minimizes:
213218

214219
.. math::
215220
\sum_{i=1}^{n} (y_i - \hat{f}_i)^2
216221
217-
subject to :math:`\hat{f}_i >= \hat{f}_j` whenever
218-
:math:`f_i >= f_j`. :math:`y_i` is the true
222+
subject to :math:`\hat{f}_i \geq \hat{f}_j` whenever
223+
:math:`f_i \geq f_j`. :math:`y_i` is the true
219224
label of sample :math:`i` and :math:`\hat{f}_i` is the output of the
220225
calibrated classifier for sample :math:`i` (i.e., the calibrated probability).
221226
This method is more general when compared to 'sigmoid' as the only restriction
@@ -277,3 +282,8 @@ one, a postprocessing is performed to normalize them.
277282
binary classifiers with beta calibration
278283
<https://projecteuclid.org/euclid.ejs/1513306867>`_
279284
Kull, M., Silva Filho, T. M., & Flach, P. (2017).
285+
286+
.. [7] Mario V. Wüthrich, Michael Merz (2023).
287+
:doi:`"Statistical Foundations of Actuarial Learning and its Applications"
288+
<10.1007/978-3-031-12409-9>`
289+
Springer Actuarial

0 commit comments

Comments
 (0)
0