How Flawed Is ECE? An Analysis via Logit Smoothing

Muthu Chidambaram, Holden Lee, Colin Mcswiggen, Semon Rezchikov
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:8417-8435, 2024.

Abstract

Informally, a model is calibrated if its predictions are correct with a probability that matches the confidence of the prediction. By far the most common method in the literature for measuring calibration is the expected calibration error (ECE). Recent work, however, has pointed out drawbacks of ECE, such as the fact that it is discontinuous in the space of predictors. In this work, we ask: how fundamental are these issues, and what are their impacts on existing results? Towards this end, we completely characterize the discontinuities of ECE with respect to general probability measures on Polish spaces. We then use the nature of these discontinuities to motivate a novel continuous, easily estimated miscalibration metric, which we term Logit-Smoothed ECE (LS-ECE). By comparing the ECE and LS-ECE of pre-trained image classification models, we show in initial experiments that binned ECE closely tracks LS-ECE, indicating that the theoretical pathologies of ECE may be avoidable in practice.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-chidambaram24a, title = {How Flawed Is {ECE}? {A}n Analysis via Logit Smoothing}, author = {Chidambaram, Muthu and Lee, Holden and Mcswiggen, Colin and Rezchikov, Semon}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {8417--8435}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/chidambaram24a/chidambaram24a.pdf}, url = {https://proceedings.mlr.press/v235/chidambaram24a.html}, abstract = {Informally, a model is calibrated if its predictions are correct with a probability that matches the confidence of the prediction. By far the most common method in the literature for measuring calibration is the expected calibration error (ECE). Recent work, however, has pointed out drawbacks of ECE, such as the fact that it is discontinuous in the space of predictors. In this work, we ask: how fundamental are these issues, and what are their impacts on existing results? Towards this end, we completely characterize the discontinuities of ECE with respect to general probability measures on Polish spaces. We then use the nature of these discontinuities to motivate a novel continuous, easily estimated miscalibration metric, which we term Logit-Smoothed ECE (LS-ECE). By comparing the ECE and LS-ECE of pre-trained image classification models, we show in initial experiments that binned ECE closely tracks LS-ECE, indicating that the theoretical pathologies of ECE may be avoidable in practice.} }
Endnote
%0 Conference Paper %T How Flawed Is ECE? An Analysis via Logit Smoothing %A Muthu Chidambaram %A Holden Lee %A Colin Mcswiggen %A Semon Rezchikov %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-chidambaram24a %I PMLR %P 8417--8435 %U https://proceedings.mlr.press/v235/chidambaram24a.html %V 235 %X Informally, a model is calibrated if its predictions are correct with a probability that matches the confidence of the prediction. By far the most common method in the literature for measuring calibration is the expected calibration error (ECE). Recent work, however, has pointed out drawbacks of ECE, such as the fact that it is discontinuous in the space of predictors. In this work, we ask: how fundamental are these issues, and what are their impacts on existing results? Towards this end, we completely characterize the discontinuities of ECE with respect to general probability measures on Polish spaces. We then use the nature of these discontinuities to motivate a novel continuous, easily estimated miscalibration metric, which we term Logit-Smoothed ECE (LS-ECE). By comparing the ECE and LS-ECE of pre-trained image classification models, we show in initial experiments that binned ECE closely tracks LS-ECE, indicating that the theoretical pathologies of ECE may be avoidable in practice.
APA
Chidambaram, M., Lee, H., Mcswiggen, C. & Rezchikov, S.. (2024). How Flawed Is ECE? An Analysis via Logit Smoothing. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:8417-8435 Available from https://proceedings.mlr.press/v235/chidambaram24a.html.

Related Material