Conformal Prediction in Practice
Prediction intervals that actually hold up. That's the promise of conformal prediction, and after running 12 slots through our V6 Express system across 3 loops, here's what we've learned.
What Is Conformal Prediction?
At its core, conformal prediction wraps any point-prediction model with a calibration layer that produces prediction sets (classification) or prediction intervals (regression) with a guaranteed coverage rate.
The key guarantee: if you ask for 90% coverage, you get at least 90% coverage on future data, assuming exchangeability.
Three Flavors We Run
StandardCP
The workhorse. Split your data into training and calibration sets. Train on training, compute nonconformity scores on calibration, use quantiles to set interval widths.
# Simplified StandardCP
scores = abs(y_cal - model.predict(X_cal))
q = np.quantile(scores, 1 - alpha)
intervals = [pred - q, pred + q]
Works well for binary classification and regression. Our BTC direction slot achieves 100% coverage (overcovering — see BUG-13).
MondrianCP
Stratified conformal prediction. Instead of one global quantile, compute separate quantiles per group. Critical for multiclass and clustering slots where class-conditional coverage matters.
Our fix for BUG-H: when groups are too small, we fall back to equi-split temporal groups (2-3 groups) rather than degenerating to a single group.
CQR (Conformalized Quantile Regression)
Trains quantile regression models at the alpha/2 and 1-alpha/2 levels, then calibrates the residuals. Produces adaptive intervals — wider where the model is uncertain.
In practice, our conformal_select module still picks StandardCP over CQR because CQR's p_opt (0.420) is much lower than StandardCP's (0.751). More work needed here.
Coverage Results: Sprint 14, Loop 3
| Slot | Type | Coverage | Target | Status |
|---|---|---|---|---|
| oil-price-regression-3m | regression | 0.860 | [0.85, 0.95] | on_target |
| m2-growth-regression-quarterly-8m | regression | 0.929 | [0.85, 0.95] | on_target |
| cpi-hybrid-quarterly-12m | hybrid | 0.887 | [0.85, 0.95] | on_target |
| btc-direction-12m | binary | 1.000 | [0.85, 0.95] | BUG-13 |
Three slots hit target. The binary overcovering problem (BUG-13) remains open — we need to lower the coverage target for binary slots.
Lessons Learned
- Exchangeability matters. Regime-switching slots (m2-growth-regime, vix-regime) show a persistent gap between calibration coverage and test coverage because the data isn't exchangeable across regimes.
- Coverage oscillates. CPI-hybrid coverage bounced between 0.762 and 0.887 across iterations as the LLM pipeline designer rotated feature sets.
- Start with StandardCP. It's simple, fast, and provides the strongest baseline. Only reach for MondrianCP when you need group-conditional guarantees.
What's Next
We're working on routing logic that selects the conformal method based on slot_problem_type, and investigating why CQR underperforms in our system despite theoretical advantages for heteroscedastic data.
Published from MacroSynchronicity Labs — Facility 4.2