CANDOR: Counterfactual ANnotated DOubly Robust Off-Policy Evaluation

Aishwarya Mandyam, Shengpu Tang, Jiayu Yao, Jenna Wiens, Barbara E Engelhardt
Proceedings of the 7th Conference on Health, Inference, and Learning, PMLR 333:644-687, 2026.

Abstract

Off-policy evaluation (OPE) is critical for applying contextual bandit algorithms to high-stakes decision-making settings such as healthcare, where new treatment policies must be evaluated prior to deployment. Unfortunately, OPE techniques are inherently limited by the breadth of the available data, which may not be sufficient to evaluate the performance of a new policy. Recent work attempts to improve dataset coverage by adding expert-annotated counterfactual samples. However, such annotations are often imperfect and can lead to worse estimator performance than using no annotations at all. To better leverage imperfect annotations, we propose a family of OPE estimators grounded in the doubly robust (DR) framework, which combines importance sampling (IS) with a reward model (direct method, DM) for better statistical guarantees. We study three ways of incorporating counterfactual annotations. Under mild assumptions, we prove that using annotations within just the DM component yields the most desirable theoretical results. Experiments on multiple healthcare tasks, including real-world electronic health records (EHR) data, show that this strategy is most robust under misspecified reward models and inaccurate annotations. By addressing the challenges posed by imperfect annotations, this work broadens the applicability of OPE methods and facilitates safer deployment of decision-making policies in healthcare.

Cite this Paper


BibTeX
@InProceedings{pmlr-v333-mandyam26a, title = {CANDOR: Counterfactual ANnotated DOubly Robust Off-Policy Evaluation}, author = {Mandyam, Aishwarya and Tang, Shengpu and Yao, Jiayu and Wiens, Jenna and Engelhardt, Barbara E}, booktitle = {Proceedings of the 7th Conference on Health, Inference, and Learning}, pages = {644--687}, year = {2026}, editor = {Healey, Elizabeth and Fries, Jason and Pollard, Tom and Tang, Shengpu and Zink, Anna and Hartvigsen, Tom and Agrawal, Monica and Finlayson, Sam and Glicksberg, Benjamin and Beaulieu-Jones, Brett and Wang, Kai and Fontalvo, Daseyra and Sarker, Tasmie and Chen, Irene and Alsentzer, Emily}, volume = {333}, series = {Proceedings of Machine Learning Research}, month = {29--30 Jun}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v333/main/assets/mandyam26a/mandyam26a.pdf}, url = {https://proceedings.mlr.press/v333/mandyam26a.html}, abstract = {Off-policy evaluation (OPE) is critical for applying contextual bandit algorithms to high-stakes decision-making settings such as healthcare, where new treatment policies must be evaluated prior to deployment. Unfortunately, OPE techniques are inherently limited by the breadth of the available data, which may not be sufficient to evaluate the performance of a new policy. Recent work attempts to improve dataset coverage by adding expert-annotated counterfactual samples. However, such annotations are often imperfect and can lead to worse estimator performance than using no annotations at all. To better leverage imperfect annotations, we propose a family of OPE estimators grounded in the doubly robust (DR) framework, which combines importance sampling (IS) with a reward model (direct method, DM) for better statistical guarantees. We study three ways of incorporating counterfactual annotations. Under mild assumptions, we prove that using annotations within just the DM component yields the most desirable theoretical results. Experiments on multiple healthcare tasks, including real-world electronic health records (EHR) data, show that this strategy is most robust under misspecified reward models and inaccurate annotations. By addressing the challenges posed by imperfect annotations, this work broadens the applicability of OPE methods and facilitates safer deployment of decision-making policies in healthcare.} }
Endnote
%0 Conference Paper %T CANDOR: Counterfactual ANnotated DOubly Robust Off-Policy Evaluation %A Aishwarya Mandyam %A Shengpu Tang %A Jiayu Yao %A Jenna Wiens %A Barbara E Engelhardt %B Proceedings of the 7th Conference on Health, Inference, and Learning %C Proceedings of Machine Learning Research %D 2026 %E Elizabeth Healey %E Jason Fries %E Tom Pollard %E Shengpu Tang %E Anna Zink %E Tom Hartvigsen %E Monica Agrawal %E Sam Finlayson %E Benjamin Glicksberg %E Brett Beaulieu-Jones %E Kai Wang %E Daseyra Fontalvo %E Tasmie Sarker %E Irene Chen %E Emily Alsentzer %F pmlr-v333-mandyam26a %I PMLR %P 644--687 %U https://proceedings.mlr.press/v333/mandyam26a.html %V 333 %X Off-policy evaluation (OPE) is critical for applying contextual bandit algorithms to high-stakes decision-making settings such as healthcare, where new treatment policies must be evaluated prior to deployment. Unfortunately, OPE techniques are inherently limited by the breadth of the available data, which may not be sufficient to evaluate the performance of a new policy. Recent work attempts to improve dataset coverage by adding expert-annotated counterfactual samples. However, such annotations are often imperfect and can lead to worse estimator performance than using no annotations at all. To better leverage imperfect annotations, we propose a family of OPE estimators grounded in the doubly robust (DR) framework, which combines importance sampling (IS) with a reward model (direct method, DM) for better statistical guarantees. We study three ways of incorporating counterfactual annotations. Under mild assumptions, we prove that using annotations within just the DM component yields the most desirable theoretical results. Experiments on multiple healthcare tasks, including real-world electronic health records (EHR) data, show that this strategy is most robust under misspecified reward models and inaccurate annotations. By addressing the challenges posed by imperfect annotations, this work broadens the applicability of OPE methods and facilitates safer deployment of decision-making policies in healthcare.
APA
Mandyam, A., Tang, S., Yao, J., Wiens, J. & Engelhardt, B.E.. (2026). CANDOR: Counterfactual ANnotated DOubly Robust Off-Policy Evaluation. Proceedings of the 7th Conference on Health, Inference, and Learning, in Proceedings of Machine Learning Research 333:644-687 Available from https://proceedings.mlr.press/v333/mandyam26a.html.

Related Material