[edit]
What do LLMs value? An evaluation framework for revealing subjective trade-offs in assessment of glycemic control
Proceedings of the Fifth Machine Learning for Health Symposium, PMLR 297:136-151, 2026.
Abstract
Clinical decisions often require balancing conflicting priorities rather than simply selecting a single “correct” answer. We present an evaluation framework that probes the value judgments embedded in large language models ({LLM}s) by testing how they assess quality of glycemic control from continuous glucose monitoring ({CGM}) data. Using synthetic type 1 diabetes profiles, we asked five commercial {LLM}s to perform pairwise comparisons of {CGM} summary statistics and derived a percentile ranking for each profile. We then quantified alignment with two reference metrics: time in range ({TIR}) and the expert-derived Glycemia Risk Index ({GRI}), which was developed with clinician input regarding preferences across glycemic ranges. Across three insulin therapy modalities, newer models showed stronger correlation with {GRI} than older models, suggesting a generational shift toward expert consensus. However, a perturbation analysis revealed instances of disagreement around the weighting of mild hypoglycemia and mild hyperglycemia relative to the {GRI}. These results demonstrate that high average agreement with clinical metrics can mask clinically meaningful misalignments in how {LLM}s prioritize risks. Our proposed framework reveals how {LLM} outputs reflect competing priorities in clinical contexts.