Can a calibration metric be both testable and actionable?

Raphael Rossellini, Jake A. Soloff, Rina Foygel Barber, Zhimei Ren, Rebecca Willett
Proceedings of Thirty Eighth Conference on Learning Theory, PMLR 291:4937-4972, 2025.

Abstract

Forecast probabilities often serve as critical inputs for binary decision making. In such settings, calibration—ensuring forecasted probabilities match empirical frequencies—is essential. Although the common notion of Expected Calibration Error (ECE) provides actionable insights for decision making, it is not testable: it cannot be empirically estimated in many practical cases. Conversely, the recently proposed Distance from Calibration (dCE) is testable, but it is not actionable since it lacks decision-theoretic guarantees needed for high-stakes applications. To resolve this question, we consider Cutoff Calibration Error, a calibration measure that bridges this gap by assessing calibration over intervals of forecasted probabilities. We show that Cutoff Calibration Error is both testable and actionable, and we examine its implications for popular post-hoc calibration methods, such as isotonic regression and Platt scaling.

Cite this Paper


BibTeX
@InProceedings{pmlr-v291-rossellini25a, title = {Can a calibration metric be both testable and actionable?}, author = {Rossellini, Raphael and Soloff, Jake A. and Barber, Rina Foygel and Ren, Zhimei and Willett, Rebecca}, booktitle = {Proceedings of Thirty Eighth Conference on Learning Theory}, pages = {4937--4972}, year = {2025}, editor = {Haghtalab, Nika and Moitra, Ankur}, volume = {291}, series = {Proceedings of Machine Learning Research}, month = {30 Jun--04 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v291/main/assets/rossellini25a/rossellini25a.pdf}, url = {https://proceedings.mlr.press/v291/rossellini25a.html}, abstract = {Forecast probabilities often serve as critical inputs for binary decision making. In such settings, calibration—ensuring forecasted probabilities match empirical frequencies—is essential. Although the common notion of Expected Calibration Error (ECE) provides actionable insights for decision making, it is not testable: it cannot be empirically estimated in many practical cases. Conversely, the recently proposed Distance from Calibration (dCE) is testable, but it is not actionable since it lacks decision-theoretic guarantees needed for high-stakes applications. To resolve this question, we consider Cutoff Calibration Error, a calibration measure that bridges this gap by assessing calibration over intervals of forecasted probabilities. We show that Cutoff Calibration Error is both testable and actionable, and we examine its implications for popular post-hoc calibration methods, such as isotonic regression and Platt scaling.} }
Endnote
%0 Conference Paper %T Can a calibration metric be both testable and actionable? %A Raphael Rossellini %A Jake A. Soloff %A Rina Foygel Barber %A Zhimei Ren %A Rebecca Willett %B Proceedings of Thirty Eighth Conference on Learning Theory %C Proceedings of Machine Learning Research %D 2025 %E Nika Haghtalab %E Ankur Moitra %F pmlr-v291-rossellini25a %I PMLR %P 4937--4972 %U https://proceedings.mlr.press/v291/rossellini25a.html %V 291 %X Forecast probabilities often serve as critical inputs for binary decision making. In such settings, calibration—ensuring forecasted probabilities match empirical frequencies—is essential. Although the common notion of Expected Calibration Error (ECE) provides actionable insights for decision making, it is not testable: it cannot be empirically estimated in many practical cases. Conversely, the recently proposed Distance from Calibration (dCE) is testable, but it is not actionable since it lacks decision-theoretic guarantees needed for high-stakes applications. To resolve this question, we consider Cutoff Calibration Error, a calibration measure that bridges this gap by assessing calibration over intervals of forecasted probabilities. We show that Cutoff Calibration Error is both testable and actionable, and we examine its implications for popular post-hoc calibration methods, such as isotonic regression and Platt scaling.
APA
Rossellini, R., Soloff, J.A., Barber, R.F., Ren, Z. & Willett, R.. (2025). Can a calibration metric be both testable and actionable?. Proceedings of Thirty Eighth Conference on Learning Theory, in Proceedings of Machine Learning Research 291:4937-4972 Available from https://proceedings.mlr.press/v291/rossellini25a.html.

Related Material