-
Notifications
You must be signed in to change notification settings - Fork 448
Description
🚀 Feature
A new metric computing the Expected Calibration Error (ECE) metric from Naeini et al 2015. Useful for determining if a classifier's softmax probability scores are well-calibrated and represent reasonable probabilities.
Motivation
I ran into this metric after seeing Guo et al 2017, which discussed how very large and deep networks are vulnerable to calibration issues (i.e. are systemically over- or under- confident), and suggest temperature scaling as a method for producing reasonably calibrated softmax outputs using a validation dataset. I've implemented a simple version of this metric using the old PyTorch Lightning Metric API, mostly based off of @gpleiss's PyTorch nn.Module implementation of the metric here.
Additional context
There was an interesting preprint discussing alternatives to ECE that also might be worth integrating - the main criticism being that ECE can be reductive in multi-class settings by not considering all probabilities produced by the model and only looking at the probability of the predicted class (i.e. the highest predicted probability). I'd have to think more about this, but it could also be worth adding.