A Calibration Metric for Risk Scores with Survival Data
Proceedings of the 4th Machine Learning for Healthcare Conference, PMLR 106:424-450, 2019.
We study methods for assessing the degree of systematic over- or under- estimation, known as calibration, of a learned risk model in an independent validation cohort. Here, we advance methods for evaluating clinical risk prediction models by deriving a population parameter measuring the average calibration error of the predicted risk from the true risk, and providing a method for estimation and inference. Our approach improves upon commonly-used goodness of fit tests that depends on subjective bin thresholding and may yield misleading results by reporting confidence intervals for the calibration error instead of a simple P-value that conflate calibration error and sample size. This approach enables comparison among multiple risk prediction models, and can guide model revision. We illustrate how our new method helps to understand the calibration of risk models that have been profoundly influential in clinical practice, but controversial due to their potential miscalibration.