What do LLMs value? An evaluation framework for revealing subjective trade-offs in assessment of glycemic control

Payal Chandak; Elizabeth Healey; Maria F. Villa-Tamayo; Agatha F. Scheideman; Mandy M. Shao; Chiara Fabris; Kenneth D. Mandl; Isaac Kohane; David C. Klonoff

What do LLMs value? An evaluation framework for revealing subjective trade-offs in assessment of glycemic control

Payal Chandak, Elizabeth Healey, Maria F. Villa-Tamayo, Agatha F. Scheideman, Mandy M. Shao, Chiara Fabris, Kenneth D. Mandl, Isaac Kohane, David C. Klonoff

Proceedings of the Fifth Machine Learning for Health Symposium, PMLR 297:136-151, 2026.

Abstract

Clinical decisions often require balancing conflicting priorities rather than simply selecting a single “correct” answer. We present an evaluation framework that probes the value judgments embedded in large language models ({LLM}s) by testing how they assess quality of glycemic control from continuous glucose monitoring ({CGM}) data. Using synthetic type 1 diabetes profiles, we asked five commercial {LLM}s to perform pairwise comparisons of {CGM} summary statistics and derived a percentile ranking for each profile. We then quantified alignment with two reference metrics: time in range ({TIR}) and the expert-derived Glycemia Risk Index ({GRI}), which was developed with clinician input regarding preferences across glycemic ranges. Across three insulin therapy modalities, newer models showed stronger correlation with {GRI} than older models, suggesting a generational shift toward expert consensus. However, a perturbation analysis revealed instances of disagreement around the weighting of mild hypoglycemia and mild hyperglycemia relative to the {GRI}. These results demonstrate that high average agreement with clinical metrics can mask clinically meaningful misalignments in how {LLM}s prioritize risks. Our proposed framework reveals how {LLM} outputs reflect competing priorities in clinical contexts.

Cite this Paper

BibTeX

@InProceedings{pmlr-v297-chandak26a,
  title = 	 {What do {LLM}s value? An evaluation framework for revealing subjective trade-offs in assessment of glycemic control},
  author =       {Chandak, Payal and Healey, Elizabeth and Villa-Tamayo, Maria F. and Scheideman, Agatha F. and Shao, Mandy M. and Fabris, Chiara and Mandl, Kenneth D. and Kohane, Isaac and Klonoff, David C.},
  booktitle = 	 {Proceedings of the Fifth Machine Learning for Health Symposium},
  pages = 	 {136--151},
  year = 	 {2026},
  editor = 	 {Argaw, Peniel and Zhang, Haoran and Jabbour, Sarah and Chandak, Payal and Ji, Jerry and Mukherjee, Sumit and Salaudeen, Olawale and Chang, Trenton and Healey, Elizabeth and Gröger, Fabian and Adibi, Amin and Hegselmann, Stefan and Wild, Benjamin and Noori, Ayush},
  volume = 	 {297},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--14 Dec},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v297/main/assets/chandak26a/chandak26a.pdf},
  url = 	 {https://proceedings.mlr.press/v297/chandak26a.html},
  abstract = 	 {Clinical decisions often require balancing conflicting priorities rather than simply selecting a single “correct” answer. We present an evaluation framework that probes the value judgments embedded in large language models ({LLM}s) by testing how they assess quality of glycemic control from continuous glucose monitoring ({CGM}) data. Using synthetic type 1 diabetes profiles, we asked five commercial {LLM}s to perform pairwise comparisons of {CGM} summary statistics and derived a percentile ranking for each profile. We then quantified alignment with two reference metrics: time in range ({TIR}) and the expert-derived Glycemia Risk Index ({GRI}), which was developed with clinician input regarding preferences across glycemic ranges. Across three insulin therapy modalities, newer models showed stronger correlation with {GRI} than older models, suggesting a generational shift toward expert consensus. However, a perturbation analysis revealed instances of disagreement around the weighting of mild hypoglycemia and mild hyperglycemia relative to the {GRI}. These results demonstrate that high average agreement with clinical metrics can mask clinically meaningful misalignments in how {LLM}s prioritize risks. Our proposed framework reveals how {LLM} outputs reflect competing priorities in clinical contexts.}
}

Endnote

%0 Conference Paper
%T What do LLMs value? An evaluation framework for revealing subjective trade-offs in assessment of glycemic control
%A Payal Chandak
%A Elizabeth Healey
%A Maria F. Villa-Tamayo
%A Agatha F. Scheideman
%A Mandy M. Shao
%A Chiara Fabris
%A Kenneth D. Mandl
%A Isaac Kohane
%A David C. Klonoff
%B Proceedings of the Fifth Machine Learning for Health Symposium
%C Proceedings of Machine Learning Research
%D 2026
%E Peniel Argaw
%E Haoran Zhang
%E Sarah Jabbour
%E Payal Chandak
%E Jerry Ji
%E Sumit Mukherjee
%E Olawale Salaudeen
%E Trenton Chang
%E Elizabeth Healey
%E Fabian Gröger
%E Amin Adibi
%E Stefan Hegselmann
%E Benjamin Wild
%E Ayush Noori	
%F pmlr-v297-chandak26a
%I PMLR
%P 136--151
%U https://proceedings.mlr.press/v297/chandak26a.html
%V 297
%X Clinical decisions often require balancing conflicting priorities rather than simply selecting a single “correct” answer. We present an evaluation framework that probes the value judgments embedded in large language models ({LLM}s) by testing how they assess quality of glycemic control from continuous glucose monitoring ({CGM}) data. Using synthetic type 1 diabetes profiles, we asked five commercial {LLM}s to perform pairwise comparisons of {CGM} summary statistics and derived a percentile ranking for each profile. We then quantified alignment with two reference metrics: time in range ({TIR}) and the expert-derived Glycemia Risk Index ({GRI}), which was developed with clinician input regarding preferences across glycemic ranges. Across three insulin therapy modalities, newer models showed stronger correlation with {GRI} than older models, suggesting a generational shift toward expert consensus. However, a perturbation analysis revealed instances of disagreement around the weighting of mild hypoglycemia and mild hyperglycemia relative to the {GRI}. These results demonstrate that high average agreement with clinical metrics can mask clinically meaningful misalignments in how {LLM}s prioritize risks. Our proposed framework reveals how {LLM} outputs reflect competing priorities in clinical contexts.

APA

Chandak, P., Healey, E., Villa-Tamayo, M.F., Scheideman, A.F., Shao, M.M., Fabris, C., Mandl, K.D., Kohane, I. & Klonoff, D.C.. (2026). What do LLMs value? An evaluation framework for revealing subjective trade-offs in assessment of glycemic control. Proceedings of the Fifth Machine Learning for Health Symposium, in Proceedings of Machine Learning Research 297:136-151 Available from https://proceedings.mlr.press/v297/chandak26a.html.

What do LLMs value? An evaluation framework for revealing subjective trade-offs in assessment of glycemic control

Abstract

Cite this Paper

Related Material