@@ -30,16 +30,20 @@ approximately 80% actually belong to the positive class.
30
30
Calibration curves
31
31
------------------
32
32
33
- Calibration curves (also known as reliability diagrams) compare how well the
34
- probabilistic predictions of a binary classifier are calibrated. It plots
35
- the true frequency of the positive label against its predicted probability,
36
- for binned predictions.
37
- The x axis represents the average predicted probability in each bin. The
38
- y axis is the *fraction of positives *, i.e. the proportion of samples whose
39
- class is the positive class (in each bin). The top calibration curve plot
40
- is created with :func: `CalibrationDisplay.from_estimators `, which uses
41
- :func: `calibration_curve ` to calculate the per bin average predicted
42
- probabilities and fraction of positives.
33
+ Calibration curves, also referred to as *reliability diagrams * (Wilks 1995 [2 ]_),
34
+ compare how well the probabilistic predictions of a binary classifier are calibrated.
35
+ It plots the frequency of the positive label (to be more precise, an estimation of the
36
+ *conditional event probability * :math: `P(Y=1 |\text {predict\_proba})`) on the y-axis
37
+ against the predicted probability :term: `predict_proba ` of a model on the x-axis.
38
+ The tricky part is to get values for the y-axis.
39
+ In scikit-learn, this is accomplished by binning the predictions such that the x-axis
40
+ represents the average predicted probability in each bin.
41
+ The y-axis is then the *fraction of positives * given the predictions of that bin, i.e.
42
+ the proportion of samples whose class is the positive class (in each bin).
43
+
44
+ The top calibration curve plot is created with
45
+ :func: `CalibrationDisplay.from_estimator `, which uses :func: `calibration_curve ` to
46
+ calculate the per bin average predicted probabilities and fraction of positives.
43
47
:func: `CalibrationDisplay.from_estimator `
44
48
takes as input a fitted classifier, which is used to calculate the predicted
45
49
probabilities. The classifier thus must have :term: `predict_proba ` method. For
@@ -56,21 +60,24 @@ by showing the number of samples in each predicted probability bin.
56
60
57
61
.. currentmodule :: sklearn.linear_model
58
62
59
- :class: `LogisticRegression ` returns well calibrated predictions by default as it directly
60
- optimizes :ref: `log_loss `. In contrast, the other methods return biased probabilities;
61
- with different biases per method:
63
+ :class: `LogisticRegression ` returns well c
8000
alibrated predictions by default as it has a
64
+ canonical link function for its loss, i.e. the logit-link for the :ref: `log_loss `.
65
+ This leads to the so-called **balance property **, see [7 ]_ and
66
+ :ref: `Logistic_regression `.
67
+ In contrast to that, the other shown models return biased probabilities; with
68
+ different biases per model.
62
69
63
70
.. currentmodule :: sklearn.naive_bayes
64
71
65
- :class: `GaussianNB ` tends to push probabilities to 0 or 1 (note the counts
72
+ :class: `GaussianNB ` (Naive Bayes) tends to push probabilities to 0 or 1 (note the counts
66
73
in the histograms). This is mainly because it makes the assumption that
67
74
features are conditionally independent given the class, which is not the
68
75
case in this dataset which contains 2 redundant features.
69
76
70
77
.. currentmodule :: sklearn.ensemble
71
78
72
79
:class: `RandomForestClassifier ` shows the opposite behavior: the histograms
73
- show peaks at approximately 0.2 and 0.9 probability , while probabilities
80
+ show peaks at probabilities approximately 0.2 and 0.9, while probabilities
74
81
close to 0 or 1 are very rare. An explanation for this is given by
75
82
Niculescu-Mizil and Caruana [1 ]_: "Methods such as bagging and random
76
83
forests that average predictions from a base set of models can have
@@ -85,18 +92,16 @@ predict values larger than 0 for this case, thus moving the average
85
92
prediction of the bagged ensemble away from 0. We observe this effect most
86
93
strongly with random forests because the base-level trees trained with
87
94
random forests have relatively high variance due to feature subsetting." As
88
- a result, the calibration curve also referred to as the reliability diagram
89
- (Wilks 1995 [2 ]_) shows a characteristic sigmoid shape, indicating that the
90
- classifier could trust its "intuition" more and return probabilities closer
95
+ a result, the calibration curve shows a characteristic sigmoid shape, indicating that
96
+ the classifier could trust its "intuition" more and return probabilities closer
91
97
to 0 or 1 typically.
92
98
93
99
.. currentmodule :: sklearn.svm
94
100
95
- Linear Support Vector Classification (:class: `LinearSVC `) shows an even more
96
- sigmoid curve than :class: `~sklearn.ensemble.RandomForestClassifier `, which is
97
- typical for maximum-margin methods (compare Niculescu-Mizil and Caruana [1 ]_),
98
- which focus on difficult to classify samples that are close to the decision
99
- boundary (the support vectors).
101
+ :class: `LinearSVC ` (SVC) shows an even more sigmoid curve than the random forest, which
102
+ is typical for maximum-margin methods (compare Niculescu-Mizil and Caruana [1 ]_), which
103
+ focus on difficult to classify samples that are close to the decision boundary (the
104
+ support vectors).
100
105
101
106
Calibrating a classifier
102
107
------------------------
@@ -107,10 +112,11 @@ Calibrating a classifier consists of fitting a regressor (called a
107
112
*calibrator *) that maps the output of the classifier (as given by
108
113
:term: `decision_function ` or :term: `predict_proba `) to a calibrated probability
109
114
in [0, 1]. Denoting the output of the classifier for a given sample by :math: `f_i`,
110
- the calibrator tries to predict :math: `p(y_i = 1 | f_i)`.
115
+ the calibrator tries to predict the conditional event probability
116
+ :math: `P(y_i = 1 | f_i)`.
111
117
112
- The samples that are used to fit the calibrator should not be the same
113
- samples used to fit the classifier, as this would introduce bias .
118
+ Ideally, the calibrator is fit on a dataset independent of the training data used to
119
+ fit the classifier in the first place .
114
120
This is because performance of the classifier on its training data would be
115
121
better than for novel data. Using the classifier output of training data
116
122
to fit the calibrator would thus result in a biased calibrator that maps to
@@ -200,22 +206,21 @@ the classifier output for each binary class is normally distributed with
200
206
the same variance [6 ]_. This can be a problem for highly imbalanced
201
207
classification problems, where outputs do not have equal variance.
202
208
203
- In general this method is most effective when the un-calibrated model is
204
- under-confident and has similar calibration errors for both high and low
205
- outputs.
209
+ In general this method is most effective for small sample sizes or when the
210
+ un-calibrated model is under-confident and has similar calibration errors for both
211
+ high and low outputs.
206
212
207
213
Isotonic
208
214
^^^^^^^^
209
215
210
216
The 'isotonic' method fits a non-parametric isotonic regressor, which outputs
211
- a step-wise non-decreasing function (see :mod: `sklearn.isotonic `). It
212
- minimizes:
217
+ a step-wise non-decreasing function, see :mod: `sklearn.isotonic `. It minimizes:
213
218
214
219
.. math ::
215
220
\sum _{i=1 }^{n} (y_i - \hat {f}_i)^2
216
221
217
- subject to :math: `\hat {f}_i >= \hat {f}_j` whenever
218
- :math: `f_i >= f_j`. :math: `y_i` is the true
222
+ subject to :math: `\hat {f}_i \geq \hat {f}_j` whenever
223
+ :math: `f_i \geq f_j`. :math: `y_i` is the true
219
224
label of sample :math: `i` and :math: `\hat {f}_i` is the output of the
220
225
calibrated classifier for sample :math: `i` (i.e., the calibrated probability).
221
226
This method is more general when compared to 'sigmoid' as the only restriction
@@ -277,3 +282,8 @@ one, a postprocessing is performed to normalize them.
277
282
binary classifiers with beta calibration
278
283
<https://projecteuclid.org/euclid.ejs/1513306867> `_
279
284
Kull, M., Silva Filho, T. M., & Flach, P. (2017).
285
+
286
+ .. [7 ] Mario V. Wüthrich, Michael Merz (2023).
287
+ :doi: `"Statistical Foundations of Actuarial Learning and its Applications"
288
+ <10.1007/978-3-031-12409-9> `
289
+ Springer Actuarial
0 commit comments