Conformal Prediction in Practice

Prediction intervals that actually hold up. That's the promise of conformal prediction, and after running 12 slots through our V6 Express system across 3 loops, here's what we've learned.

$ coverage_report --sprint 14 --loop 3 --all-slots on_target: 3 / 12 OK overcovering: 1 / 12 BUG-13 undercovering: 8 / 12 action required binary slots: overcovering at 1.00 regression slots: partial on-target at 0.86–0.93 regime slots: persistent cal/test gap (non-exchangeable)

What Is Conformal Prediction?

At its core, conformal prediction wraps any point-prediction model with a calibration layer that produces prediction sets (classification) or prediction intervals (regression) with a guaranteed coverage rate.

The key guarantee: if you ask for 90% coverage, you get at least 90% coverage on future data, assuming exchangeability.

Three Flavors We Run

StandardCP

The workhorse. Split your data into training and calibration sets. Train on training, compute nonconformity scores on calibration, use quantiles to set interval widths.

# Simplified StandardCP
scores = abs(y_cal - model.predict(X_cal))
q = np.quantile(scores, 1 - alpha)
intervals = [pred - q, pred + q]

Works well for binary classification and regression. Our BTC direction slot achieves 100% coverage (overcovering — see BUG-13).

$ audit --bug-13 --symptom binary_overcovering affected_slot: btc-direction-12m coverage: 1.000 → interval too wide target_band: [0.85, 0.95] hypothesis: binary nonconformity score distribution is bimodal; global quantile clips to the outer mode. remediation: per-class quantile (MondrianCP) OR lower coverage_target to 0.88.

MondrianCP

Stratified conformal prediction. Instead of one global quantile, compute separate quantiles per group. Critical for multiclass and clustering slots where class-conditional coverage matters.

Our fix for BUG-H: when groups are too small, we fall back to equi-split temporal groups (2-3 groups) rather than degenerating to a single group.

$ audit --bug-h --symptom regime_instability affected_slots: vix-regime-multiclass-3m, yield-curve-regime-quarterly-12m coverage_std: 0.20+ non-exchangeable per_group_samples: < 5 in multiple strata fix_path: fall back to pooled StandardCP when n_g < ceil(1/alpha)-1 tag: SCAN30-S30-2

CQR (Conformalized Quantile Regression)

Trains quantile regression models at the alpha/2 and 1-alpha/2 levels, then calibrates the residuals. Produces adaptive intervals — wider where the model is uncertain.

In practice, our conformal_select module still picks StandardCP over CQR because CQR's p_opt (0.420) is much lower than StandardCP's (0.751). More work needed here.

Coverage Results: Sprint 14, Loop 3

$ coverage_report --sprint 14 --loop 3 --format table

Slot	Type	Coverage	Target	Status
oil-price-regression-3m	regression	0.860	[0.85, 0.95]	on_target
m2-growth-regression-quarterly-8m	regression	0.929	[0.85, 0.95]	on_target
cpi-hybrid-quarterly-12m	hybrid	0.887	[0.85, 0.95]	on_target
btc-direction-12m	binary	1.000	[0.85, 0.95]	BUG-13

Three slots hit target. The binary overcovering problem (BUG-13) remains open — we need to lower the coverage target for binary slots.

Lessons Learned

Exchangeability matters. Regime-switching slots (m2-growth-regime, vix-regime) show a persistent gap between calibration coverage and test coverage because the data isn't exchangeable across regimes.
Coverage oscillates. CPI-hybrid coverage bounced between 0.762 and 0.887 across iterations as the LLM pipeline designer rotated feature sets.
Start with StandardCP. It's simple, fast, and provides the strongest baseline. Only reach for MondrianCP when you need group-conditional guarantees.

What's Next

We're working on routing logic that selects the conformal method based on slot_problem_type, and investigating why CQR underperforms in our system despite theoretical advantages for heteroscedastic data.

$ conformal_select --route-table --summary StandardCP: p_opt 0.751 production default MondrianCP: p_opt 0.612 regime slots only CQR: p_opt 0.420 deprioritized pending investigation WeightedCP: p_opt 0.684 binary w/ importance weights "The signal was always there. You were just reading the wrong series." — Floor 7 Archives

Published from MacroSynchronicity Labs — Facility 4.2