Regression — Theoretical Description¶
Terminology
In theoretical parts of the documentation:
alphais equivalent to1 - confidence_level— it can be seen as a risk level.- calibrate and calibration are equivalent to conformalize and conformalization.
The methods in mapie.regression use various resampling methods based on the jackknife strategy recently introduced by Foygel-Barber et al. (2020) 1. They allow the user to estimate robust prediction intervals with any kind of machine learning model for regression purposes on single-output data.
Mathematical Setting¶
For a regression problem in a standard i.i.d. case, our training data \((X, Y) = \{(x_1, y_1), \ldots, (x_n, y_n)\}\) has an unknown distribution \(P_{X, Y}\). We assume that \(Y = \mu(X) + \epsilon\) where \(\mu\) is the model function and \(\epsilon_i \sim P_{Y \mid X}\) is the noise.
Given some target quantile \(\alpha\), we aim at constructing a prediction interval \(\hat{C}_{n, \alpha}\) such that:
All methods below are described with the absolute residual conformity score for simplicity, but other scores are available (see Conformity Scores).
1. The "Naive" Method¶
The naive method computes the residuals of the training data to estimate the typical error on a new test data point:
where \(\hat{q}_{n, \alpha}^+\) is the \((1-\alpha)\) quantile of the distribution.
Warning
Since this method estimates conformity scores on the training set, it tends to be too optimistic and underestimates the width of prediction intervals due to potential overfitting.
2. The Split Method¶
The split method computes residuals on a calibration dataset separate from the training set:
Info
This method is very similar to the naive one — the only difference is that conformity scores are computed on the calibration set rather than the training set. One must have enough observations to split the dataset.
3. The Jackknife Method¶
The standard jackknife method is based on leave-one-out models:
- For each instance \(i = 1, \ldots, n\), fit \(\hat{\mu}_{-i}\) on the training set with the \(i\)-th point removed.
- Compute the leave-one-out conformity score: \(|Y_i - \hat{\mu}_{-i}(X_i)|\).
- Fit \(\hat{\mu}\) on the entire training set and compute the prediction interval:
where \(R_i^{\text{LOO}} = |Y_i - \hat{\mu}_{-i}(X_i)|\).
Warning
This method avoids overfitting but can lose its predictive coverage when \(\hat{\mu}\) becomes unstable (e.g., when the sample size is close to the number of features).
4. The Jackknife+ Method¶
Unlike the standard jackknife, the jackknife+ uses each leave-one-out prediction on the new test point to account for variability:
Guarantee
This method guarantees a coverage level of \(1-2\alpha\) for a target of \(1-\alpha\), without any a priori assumption on the data distribution nor on the predictive model 1.
5. The Jackknife-Minmax Method¶
A more conservative alternative using the min and max of leave-one-out predictions:
Guarantee
This method guarantees a coverage level of \(1-\alpha\).
Computational Cost
The jackknife methods require running as many simulations as training points, which can be prohibitive for large datasets.
6. The CV+ Method¶
To reduce computational time, one can use a cross-validation approach instead of leave-one-out:
- Split the training set into \(K\) disjoint subsets of equal size.
- Fit \(K\) regression functions \(\hat{\mu}_{-S_k}\).
- Compute out-of-fold conformity scores for each point.
- Use the regression functions to estimate prediction intervals.
Guarantee
Like jackknife+, CV+ guarantees coverage ≥ \(1-2\alpha\). The jackknife+ can be viewed as a special case where \(K = n\).
7. The CV and CV-Minmax Methods¶
By analogy with the standard jackknife and jackknife-minmax, these rely on out-of-fold regression models.
8. The Jackknife+-After-Bootstrap Method¶
Uses bootstrap instead of leave-one-out for reduced computational time and more robust predictions 2:
- Resample the training set with replacement \(K\) times → bootstraps \(B_1, \ldots, B_K\).
- Fit \(K\) regression functions on the bootstraps and compute predictions on complementary sets.
- Aggregate predictions (mean or median) and compute conformity scores.
- Use aggregated predictions to estimate prediction intervals.
Guarantee
Coverage ≥ \(1-2\alpha\), same as jackknife+.
9. Conformalized Quantile Regression (CQR)¶
CQR allows for better interval widths with heteroscedastic data by using quantile regressors:
Formulation¶
The prediction interval for a new sample \(X_{n+1}\):
Where:
- \(\hat{q}_{\alpha_{\text{lo}}}\) and \(\hat{q}_{\alpha_{\text{hi}}}\) are the predicted lower and upper quantiles.
- \(Q_{1-\alpha}\) is the empirical quantile of residuals from the calibration set.
Symmetric variant
In the symmetric method, \(E_{\text{low}}\) and \(E_{\text{high}}\) are merged into \(E_{\text{all}}\), and the quantile is calculated on all absolute residuals.
10. EnbPI (Ensemble Batch Prediction Intervals)¶
For time series where the exchangeability hypothesis does not hold:
Key features:
- Residuals are updated dynamically during prediction when new observations are available.
- Coverage guarantee is asymptotic under two hypotheses:
- Errors are short-term i.i.d.
- Estimation quality converges.
Trade-off
The bigger the training set, the better the covering guarantee. But if the model is not refitted, larger training sets slow down the residual update.
Key Takeaways¶
| Feature | Recommendation |
|---|---|
| Accurate & robust intervals | Jackknife+ |
| Large datasets | CV+ or Jackknife+-after-bootstrap |
| Conservative estimates | Jackknife-minmax / CV-minmax |
| Heteroscedastic data | CQR |
| Time series | EnbPI |
Method Comparison¶
| Method | Theoretical coverage | Typical coverage | Training cost | Evaluation cost |
|---|---|---|---|---|
| Naïve | No guarantee | \(< 1-\alpha\) | 1 | \(n_{\text{test}}\) |
| Split | \(\geq 1-\alpha\) | \(\simeq 1-\alpha\) | 1 | \(n_{\text{test}}\) |
| Jackknife | No guarantee | \(\simeq 1-\alpha\) | \(n\) | \(n_{\text{test}}\) |
| Jackknife+ | \(\geq 1-2\alpha\) | \(\simeq 1-\alpha\) | \(n\) | \(n \times n_{\text{test}}\) |
| Jackknife-minmax | \(\geq 1-\alpha\) | \(> 1-\alpha\) | \(n\) | \(n \times n_{\text{test}}\) |
| CV | No guarantee | \(\simeq 1-\alpha\) | \(K\) | \(n_{\text{test}}\) |
| CV+ | \(\geq 1-2\alpha\) | \(\gtrsim 1-\alpha\) | \(K\) | \(K \times n_{\text{test}}\) |
| CV-minmax | \(\geq 1-\alpha\) | \(> 1-\alpha\) | \(K\) | \(K \times n_{\text{test}}\) |
| Jackknife-aB+ | \(\geq 1-2\alpha\) | \(\gtrsim 1-\alpha\) | \(K\) | \(K \times n_{\text{test}}\) |
| CQR | \(\geq 1-\alpha\) | \(\gtrsim 1-\alpha\) | 3 | \(3 \times n_{\text{test}}\) |
| EnbPI | \(\geq 1-\alpha\) (asymptotic) | \(\gtrsim 1-\alpha\) | \(K\) | \(K \times n_{\text{test}}\) |
References¶
-
Rina Foygel Barber, Emmanuel J. Candès, Aaditya Ramdas, and Ryan J. Tibshirani. "Predictive inference with the jackknife+." Ann. Statist., 49(1):486–507, (2021). ↩↩
-
Kim, Byol, Chen Xu, and Rina Barber. "Predictive inference is free with the jackknife+-after-bootstrap." NeurIPS 33 (2020): 4138-4149. ↩