Revisiting Performance Claims for Chest X-Ray Models Using Clinical Context

Andrew Wang; Jiashou Zhang; Michael Oberst

Revisiting Performance Claims for Chest X-Ray Models Using Clinical Context

Andrew Wang, Jiashou Zhang, Michael Oberst

Proceedings of the 7th Conference on Health, Inference, and Learning, PMLR 333:493-536, 2026.

Abstract

Public datasets of Chest X-Rays (CXRs) have long been a popular benchmark for developing machine learning (ML) computer vision models in healthcare. However, the reported strong average-case performance of these models do not necessarily reflect their actual utility when used in heterogeneous clinical settings, potentially masking weaker performance in medically significant scenarios. In this work we use clinical context to provide a more holistic evaluation of models for CXR diagnosis. In particular, we use discharge summaries, recorded prior to each CXR, to derive a “pre-CXR” probability of each CXR label, as a proxy for existing contextual knowledge available to clinicians when interpreting CXRs. We use this measure to probe model performance along two dimensions: First, using a stratified analysis, we show that models tend to have lower performance (as measured by AUROC and other metrics) among individuals with higher pre-CXR probability. Second, by controlling for pre-CXR probability via matching and re-weighting, we demonstrate that performance degrades when the correlation is broken between prior context and the current CXR label, suggesting that model performance is highly sensitive to the underlying distribution of clinical context. Specifically, cases with high pre-test probabilities present a fundamentally more difficult visual classification task, highlighting a gap in clinical utility when models are applied to high-risk cohorts.

Cite this Paper

BibTeX

@InProceedings{pmlr-v333-wang26c,
  title = 	 {Revisiting Performance Claims for Chest X-Ray Models Using Clinical Context},
  author =       {Wang, Andrew and Zhang, Jiashou and Oberst, Michael},
  booktitle = 	 {Proceedings of the 7th Conference on Health, Inference, and Learning},
  pages = 	 {493--536},
  year = 	 {2026},
  editor = 	 {Healey, Elizabeth and Fries, Jason and Pollard, Tom and Tang, Shengpu and Zink, Anna and Hartvigsen, Tom and Agrawal, Monica and Finlayson, Sam and Glicksberg, Benjamin and Beaulieu-Jones, Brett and Wang, Kai and Fontalvo, Daseyra and Sarker, Tasmie and Chen, Irene and Alsentzer, Emily},
  volume = 	 {333},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {29--30 Jun},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v333/main/assets/wang26c/wang26c.pdf},
  url = 	 {https://proceedings.mlr.press/v333/wang26c.html},
  abstract = 	 {Public datasets of Chest X-Rays (CXRs) have long been a popular benchmark for developing machine learning (ML) computer vision models in healthcare. However, the reported strong average-case performance of these models do not necessarily reflect their actual utility when used in heterogeneous clinical settings, potentially masking weaker performance in medically significant scenarios. In this work we use clinical context to provide a more holistic evaluation of models for CXR diagnosis. In particular, we use discharge summaries, recorded prior to each CXR, to derive a “pre-CXR” probability of each CXR label, as a proxy for existing contextual knowledge available to clinicians when interpreting CXRs. We use this measure to probe model performance along two dimensions:  First, using a stratified analysis, we show that models tend to have lower performance (as measured by AUROC and other metrics) among individuals with higher pre-CXR probability.  Second, by controlling for pre-CXR probability via matching and re-weighting, we demonstrate that performance degrades when the correlation is broken between prior context and the current CXR label, suggesting that model performance is highly sensitive to the underlying distribution of clinical context. Specifically, cases with high pre-test probabilities present a fundamentally more difficult visual classification task, highlighting a gap in clinical utility when models are applied to high-risk cohorts.}
}

Endnote

%0 Conference Paper
%T Revisiting Performance Claims for Chest X-Ray Models Using Clinical Context
%A Andrew Wang
%A Jiashou Zhang
%A Michael Oberst
%B Proceedings of the 7th Conference on Health, Inference, and Learning
%C Proceedings of Machine Learning Research
%D 2026
%E Elizabeth Healey
%E Jason Fries
%E Tom Pollard
%E Shengpu Tang
%E Anna Zink
%E Tom Hartvigsen
%E Monica Agrawal
%E Sam Finlayson
%E Benjamin Glicksberg
%E Brett Beaulieu-Jones
%E Kai Wang
%E Daseyra Fontalvo
%E Tasmie Sarker
%E Irene Chen
%E Emily Alsentzer	
%F pmlr-v333-wang26c
%I PMLR
%P 493--536
%U https://proceedings.mlr.press/v333/wang26c.html
%V 333
%X Public datasets of Chest X-Rays (CXRs) have long been a popular benchmark for developing machine learning (ML) computer vision models in healthcare. However, the reported strong average-case performance of these models do not necessarily reflect their actual utility when used in heterogeneous clinical settings, potentially masking weaker performance in medically significant scenarios. In this work we use clinical context to provide a more holistic evaluation of models for CXR diagnosis. In particular, we use discharge summaries, recorded prior to each CXR, to derive a “pre-CXR” probability of each CXR label, as a proxy for existing contextual knowledge available to clinicians when interpreting CXRs. We use this measure to probe model performance along two dimensions:  First, using a stratified analysis, we show that models tend to have lower performance (as measured by AUROC and other metrics) among individuals with higher pre-CXR probability.  Second, by controlling for pre-CXR probability via matching and re-weighting, we demonstrate that performance degrades when the correlation is broken between prior context and the current CXR label, suggesting that model performance is highly sensitive to the underlying distribution of clinical context. Specifically, cases with high pre-test probabilities present a fundamentally more difficult visual classification task, highlighting a gap in clinical utility when models are applied to high-risk cohorts.

APA

Wang, A., Zhang, J. & Oberst, M.. (2026). Revisiting Performance Claims for Chest X-Ray Models Using Clinical Context. Proceedings of the 7th Conference on Health, Inference, and Learning, in Proceedings of Machine Learning Research 333:493-536 Available from https://proceedings.mlr.press/v333/wang26c.html.

Related Material

Download PDF