Metrics for Calibration — Theoretical Description¶
This document provides detailed descriptions of various metrics used to evaluate the performance of predictive models, particularly focusing on their ability to estimate uncertainties and calibrate predictions accurately.
Expected Calibration Error (ECE)¶
Measures the difference between predicted confidence levels and actual accuracy:
where:
- \(\text{acc}(B_m) = \frac{1}{|B_m|} \sum_{i \in B_m} y_i\) — accuracy within bin \(m\)
- \(\text{conf}(B_m) = \frac{1}{|B_m|} \sum_{i \in B_m} \hat{f}(x_i)\) — average confidence in bin \(m\)
Tip
The lower the ECE, the better the calibration.
Top-Label ECE¶
Extends ECE to multi-class settings, focusing on calibration of the most confident prediction (top-label):
where:
- \(L\) = number of unique labels
- \(B_{i,j}\) = indices in bin \(i\) for label \(j\)
- \(n_j\) = total samples for label \(j\)
Cumulative Differences¶
Calculates the cumulative differences between sorted true values and prediction scores 1:
Kolmogorov-Smirnov Statistics¶
Tests whether the calibration curve deviates significantly from the ideal diagonal line.
Kuiper Statistics¶
Similar to KS but captures both positive and negative deviations from perfect calibration.
Spiegelhalter Statistics¶
Tests the overall calibration of predicted probabilities.
References¶
-
Arrieta-Ibarra I, et al. "Metrics of calibration for probabilistic predictions." JMLR 23(1), 2022. ↩