APRIL: Annotations for Policy evaluation with Reliable Inference from LLMs

Aishwarya Mandyam; Kalyani Limaye; Barbara E. Engelhardt; Emily Alsentzer

APRIL: Annotations for Policy evaluation with Reliable Inference from LLMs

Aishwarya Mandyam, Kalyani Limaye, Barbara E. Engelhardt, Emily Alsentzer

Proceedings of the Fifth Machine Learning for Health Symposium, PMLR 297:1-22, 2026.

Abstract

Off-policy evaluation (OPE) estimates the value of a contextual bandit policy prior to deployment. As such, OPE plays a critical role in ensuring safety in high-stakes domains such as healthcare. However, standard OPE approaches are limited by the size and coverage of the behavior dataset. While previous work has explored using expert-labeled counterfactual annotations to enhance dataset coverage, obtaining such annotations is expensive, limiting the scalability of prior approaches. We propose leveraging large language models (LLMs) to generate counterfactual annotations for OPE in medical domains. Our method uses domain knowledge to guide LLMs in predicting how key clinical features evolve under alternate treatments. These predicted features can then be transformed using known reward functions to create counterfactual annotations. We first evaluate the ability of several LLMs to predict clinical features across two patient subsets in MIMIC-IV, finding that state-of-the-art LLMs achieve comparable performance. Building on this capacity to predict clinical features, we generate LLM-based counterfactual annotations and incorporate them into an OPE estimator. Our empirical results analyze the benefits of counterfactual annotations under varying degrees of shift between the behavior and target policies. We find that in most cases, the LLM-based counterfactual annotations significantly improve OPE estimates up to a point. We provide an entropy-based metric to identify when additional annotations cease to be useful. Our results demonstrate that LLM-based counterfactual annotations offer a scalable approach for addressing coverage limitations in healthcare datasets, enabling safer deployment of decision-making policies in clinical settings.

Cite this Paper

BibTeX

@InProceedings{pmlr-v297-mandyam26a,
  title = 	 {{APRIL}: Annotations for Policy evaluation with Reliable Inference from {LLM}s},
  author =       {Mandyam, Aishwarya and Limaye, Kalyani and Engelhardt, Barbara E. and Alsentzer, Emily},
  booktitle = 	 {Proceedings of the Fifth Machine Learning for Health Symposium},
  pages = 	 {1--22},
  year = 	 {2026},
  editor = 	 {Argaw, Peniel and Zhang, Haoran and Jabbour, Sarah and Chandak, Payal and Ji, Jerry and Mukherjee, Sumit and Salaudeen, Olawale and Chang, Trenton and Healey, Elizabeth and Gröger, Fabian and Adibi, Amin and Hegselmann, Stefan and Wild, Benjamin and Noori, Ayush},
  volume = 	 {297},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--14 Dec},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v297/main/assets/mandyam26a/mandyam26a.pdf},
  url = 	 {https://proceedings.mlr.press/v297/mandyam26a.html},
  abstract = 	 {Off-policy evaluation (OPE) estimates the value of a contextual bandit policy prior to deployment. As such, OPE plays a critical role in ensuring safety in high-stakes domains such as healthcare. However, standard OPE approaches are limited by the size and coverage of the behavior dataset. While previous work has explored using expert-labeled counterfactual annotations to enhance dataset coverage, obtaining such annotations is expensive, limiting the scalability of prior approaches. We propose leveraging large language models (LLMs) to generate counterfactual annotations for OPE in medical domains. Our method uses domain knowledge to guide LLMs in predicting how key clinical features evolve under alternate treatments. These predicted features can then be transformed using known reward functions to create counterfactual annotations. We first evaluate the ability of several LLMs to predict clinical features across two patient subsets in MIMIC-IV, finding that state-of-the-art LLMs achieve comparable performance. Building on this capacity to predict clinical features, we generate LLM-based counterfactual annotations and incorporate them into an OPE estimator. Our empirical results analyze the benefits of counterfactual annotations under varying degrees of shift between the behavior and target policies. We find that in most cases, the LLM-based counterfactual annotations significantly improve OPE estimates up to a point. We provide an entropy-based metric to identify when additional annotations cease to be useful. Our results demonstrate that LLM-based counterfactual annotations offer a scalable approach for addressing coverage limitations in healthcare datasets, enabling safer deployment of decision-making policies in clinical settings.}
}

Endnote

%0 Conference Paper
%T APRIL: Annotations for Policy evaluation with Reliable Inference from LLMs
%A Aishwarya Mandyam
%A Kalyani Limaye
%A Barbara E. Engelhardt
%A Emily Alsentzer
%B Proceedings of the Fifth Machine Learning for Health Symposium
%C Proceedings of Machine Learning Research
%D 2026
%E Peniel Argaw
%E Haoran Zhang
%E Sarah Jabbour
%E Payal Chandak
%E Jerry Ji
%E Sumit Mukherjee
%E Olawale Salaudeen
%E Trenton Chang
%E Elizabeth Healey
%E Fabian Gröger
%E Amin Adibi
%E Stefan Hegselmann
%E Benjamin Wild
%E Ayush Noori	
%F pmlr-v297-mandyam26a
%I PMLR
%P 1--22
%U https://proceedings.mlr.press/v297/mandyam26a.html
%V 297
%X Off-policy evaluation (OPE) estimates the value of a contextual bandit policy prior to deployment. As such, OPE plays a critical role in ensuring safety in high-stakes domains such as healthcare. However, standard OPE approaches are limited by the size and coverage of the behavior dataset. While previous work has explored using expert-labeled counterfactual annotations to enhance dataset coverage, obtaining such annotations is expensive, limiting the scalability of prior approaches. We propose leveraging large language models (LLMs) to generate counterfactual annotations for OPE in medical domains. Our method uses domain knowledge to guide LLMs in predicting how key clinical features evolve under alternate treatments. These predicted features can then be transformed using known reward functions to create counterfactual annotations. We first evaluate the ability of several LLMs to predict clinical features across two patient subsets in MIMIC-IV, finding that state-of-the-art LLMs achieve comparable performance. Building on this capacity to predict clinical features, we generate LLM-based counterfactual annotations and incorporate them into an OPE estimator. Our empirical results analyze the benefits of counterfactual annotations under varying degrees of shift between the behavior and target policies. We find that in most cases, the LLM-based counterfactual annotations significantly improve OPE estimates up to a point. We provide an entropy-based metric to identify when additional annotations cease to be useful. Our results demonstrate that LLM-based counterfactual annotations offer a scalable approach for addressing coverage limitations in healthcare datasets, enabling safer deployment of decision-making policies in clinical settings.

APA

Mandyam, A., Limaye, K., Engelhardt, B.E. & Alsentzer, E.. (2026). APRIL: Annotations for Policy evaluation with Reliable Inference from LLMs. Proceedings of the Fifth Machine Learning for Health Symposium, in Proceedings of Machine Learning Research 297:1-22 Available from https://proceedings.mlr.press/v297/mandyam26a.html.

APRIL: Annotations for Policy evaluation with Reliable Inference from LLMs

Abstract

Cite this Paper

Related Material