Multiaccuracy for Subpopulation Calibration Over Distribution Shift in Medical Prediction Models

Daniel Kapash; Noam Barda; Omer Reingold; Noa Dagan; Ran Balicer

Multiaccuracy for Subpopulation Calibration Over Distribution Shift in Medical Prediction Models

Daniel Kapash, Noam Barda, Omer Reingold, Noa Dagan, Ran Balicer

Proceedings of the sixth Conference on Health, Inference, and Learning, PMLR 287:130-144, 2025.

Abstract

Multiaccuracy was previously demonstrated to improve subpopulation calibration in medical prediction models, ensuring fairness towards subpopulations. Medical prediction models often experience degraded performance due to distribution shifts (e.g. changes in input data resulting from changes in space or time), but the effectiveness of multiaccuracy in ensuring medical predictors’ fairness under these circumstances was suggested theoretically but has yet to be studied empirically. To explore this, we trained prediction models using real-world data, applied an adaptation of multiaccuracy as a post-processing step to intersecting subpopulations defined by combinations of protected features such as age, gender, and socioeconomic status, and tested the performance of the models on target test sets from distributions different than the development cohorts. The results demonstrated that the improvement in subpopulation calibration achieved by multiaccuracy was maintained in the target distribution over two experiments, simulating spatial-temporal and migration-induced distribution shifts. On average, over the two experiments, Calibration in the Large mean error and variance measures were reduced by 71.8% and 70.7% on the target distributions after applying multiaccuracy, respectively. These findings highlight the potential of post-processing for multiaccuracy as a tool for enhancing the fairness and reliability of medical prediction models across diverse populations, even under circumstances of major distribution shifts.

Cite this Paper

BibTeX

@InProceedings{pmlr-v287-kapash25a,
  title = 	 {Multiaccuracy for Subpopulation Calibration Over Distribution Shift in Medical Prediction Models},
  author =       {Kapash, Daniel and Barda, Noam and Reingold, Omer and Dagan, Noa and Balicer, Ran},
  booktitle = 	 {Proceedings of the sixth Conference on Health, Inference, and Learning},
  pages = 	 {130--144},
  year = 	 {2025},
  editor = 	 {Xu, Xuhai Orson and Choi, Edward and Singhal, Pankhuri and Gerych, Walter and Tang, Shengpu and Agrawal, Monica and Subbaswamy, Adarsh and Sizikova, Elena and Dunn, Jessilyn and Daneshjou, Roxana and Sarker, Tasmie and McDermott, Matthew and Chen, Irene},
  volume = 	 {287},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {25--27 Jun},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v287/main/assets/kapash25a/kapash25a.pdf},
  url = 	 {https://proceedings.mlr.press/v287/kapash25a.html},
  abstract = 	 {Multiaccuracy was previously demonstrated to improve subpopulation calibration in medical prediction models, ensuring fairness towards subpopulations. Medical prediction models often experience degraded performance due to distribution shifts (e.g. changes in input data resulting from changes in space or time), but the effectiveness of multiaccuracy in ensuring medical predictors’ fairness under these circumstances was suggested theoretically but has yet to be studied empirically. To explore this, we trained prediction models using real-world data, applied an adaptation of multiaccuracy as a post-processing step to intersecting subpopulations defined by combinations of protected features such as age, gender, and socioeconomic status, and tested the performance of the models on target test sets from distributions different than the development cohorts. The results demonstrated that the improvement in subpopulation calibration achieved by multiaccuracy was maintained in the target distribution over two experiments, simulating spatial-temporal and migration-induced distribution shifts. On average, over the two experiments, Calibration in the Large mean error and variance measures were reduced by 71.8% and 70.7% on the target distributions after applying multiaccuracy, respectively. These findings highlight the potential of post-processing for multiaccuracy as a tool for enhancing the fairness and reliability of medical prediction models across diverse populations, even under circumstances of major distribution shifts.}
}

Endnote

%0 Conference Paper
%T Multiaccuracy for Subpopulation Calibration Over Distribution Shift in Medical Prediction Models
%A Daniel Kapash
%A Noam Barda
%A Omer Reingold
%A Noa Dagan
%A Ran Balicer
%B Proceedings of the sixth Conference on Health, Inference, and Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Xuhai Orson Xu
%E Edward Choi
%E Pankhuri Singhal
%E Walter Gerych
%E Shengpu Tang
%E Monica Agrawal
%E Adarsh Subbaswamy
%E Elena Sizikova
%E Jessilyn Dunn
%E Roxana Daneshjou
%E Tasmie Sarker
%E Matthew McDermott
%E Irene Chen	
%F pmlr-v287-kapash25a
%I PMLR
%P 130--144
%U https://proceedings.mlr.press/v287/kapash25a.html
%V 287
%X Multiaccuracy was previously demonstrated to improve subpopulation calibration in medical prediction models, ensuring fairness towards subpopulations. Medical prediction models often experience degraded performance due to distribution shifts (e.g. changes in input data resulting from changes in space or time), but the effectiveness of multiaccuracy in ensuring medical predictors’ fairness under these circumstances was suggested theoretically but has yet to be studied empirically. To explore this, we trained prediction models using real-world data, applied an adaptation of multiaccuracy as a post-processing step to intersecting subpopulations defined by combinations of protected features such as age, gender, and socioeconomic status, and tested the performance of the models on target test sets from distributions different than the development cohorts. The results demonstrated that the improvement in subpopulation calibration achieved by multiaccuracy was maintained in the target distribution over two experiments, simulating spatial-temporal and migration-induced distribution shifts. On average, over the two experiments, Calibration in the Large mean error and variance measures were reduced by 71.8% and 70.7% on the target distributions after applying multiaccuracy, respectively. These findings highlight the potential of post-processing for multiaccuracy as a tool for enhancing the fairness and reliability of medical prediction models across diverse populations, even under circumstances of major distribution shifts.

APA

Kapash, D., Barda, N., Reingold, O., Dagan, N. & Balicer, R.. (2025). Multiaccuracy for Subpopulation Calibration Over Distribution Shift in Medical Prediction Models. Proceedings of the sixth Conference on Health, Inference, and Learning, in Proceedings of Machine Learning Research 287:130-144 Available from https://proceedings.mlr.press/v287/kapash25a.html.

Related Material

Download PDF