Metrics — Theoretical Description¶
Terminology
In theoretical parts of the documentation:
alphais equivalent to1 - confidence_level— it can be seen as a risk level.- calibrate and calibration are equivalent to conformalize and conformalization.
This document provides detailed descriptions of various metrics used to evaluate the performance of predictive models, particularly focusing on their ability to estimate uncertainties and calibrate predictions accurately.
1. General Metrics¶
Regression Coverage Score (RCS)¶
Calculates the fraction of true outcomes that fall within the provided prediction intervals:
Regression Mean Width Score (RMWS)¶
Assesses the average width of the prediction intervals:
Classification Coverage Score (CCS)¶
Measures how often the true class labels fall within the predicted sets:
Classification Mean Width Score (CMWS)¶
Average size of the prediction sets across all samples:
Size-Stratified Coverage (SSC)¶
Evaluates how the size of prediction sets or intervals affects their ability to cover true outcomes 1:
Regression:
Classification:
Hilbert-Schmidt Independence Criterion (HSIC)¶
A non-parametric measure of independence between interval sizes and coverage indicators 3:
where:
- \(\mathbf{K}\), \(\mathbf{L}\) are kernel matrices for interval sizes and coverage indicators
- \(\mathbf{H} = \mathbf{I} - \frac{1}{n}\mathbf{1}\mathbf{1}^\top\) is the centering matrix
Coverage Width-Based Criterion (CWC)¶
Balances empirical coverage and width, rewarding narrow intervals and penalizing poor coverage 4:
Mean Winkler Interval Score (MWI)¶
Combines interval width with a penalty for non-coverage 5:
2. Calibration Metrics¶
Expected Calibration Error (ECE)¶
Measures the difference between predicted confidence levels and actual accuracy:
where:
- \(\text{acc}(B_m) = \frac{1}{|B_m|} \sum_{i \in B_m} y_i\) — accuracy within bin \(m\)
- \(\text{conf}(B_m) = \frac{1}{|B_m|} \sum_{i \in B_m} \hat{f}(x_i)\) — average confidence in bin \(m\)
Tip
The lower the ECE, the better the calibration.
Top-Label ECE¶
Extends ECE to multi-class settings, focusing on calibration of the most confident prediction (top-label):
where:
- \(L\) = number of unique labels
- \(B_{i,j}\) = indices in bin \(i\) for label \(j\)
- \(n_j\) = total samples for label \(j\)
Cumulative Differences¶
Calculates the cumulative differences between sorted true values and prediction scores 2:
Kolmogorov-Smirnov Statistics¶
Tests whether the calibration curve deviates significantly from the ideal diagonal line.
Kuiper Statistics¶
Similar to KS but captures both positive and negative deviations from perfect calibration.
Spiegelhalter Statistics¶
Tests the overall calibration of predicted probabilities.
References¶
-
Angelopoulos, A. N., et al. "Uncertainty Sets for Image Classifiers using Conformal Prediction." ICLR 2021. ↩
-
Arrieta-Ibarra I, et al. "Metrics of calibration for probabilistic predictions." JMLR 23(1), 2022. ↩
-
Gretton, A., et al. "A Kernel Two-Sample Test." JMLR, 2012. ↩
-
Khosravi, A., et al. "Comprehensive Review of Neural Network-Based Prediction Intervals." IEEE Trans. Neural Netw., 2011. ↩
-
Winkler, R. L. "A Decision-Theoretic Approach to Interval Estimation." JASA, 1972. ↩