Balancing Safety and Helpfulness in Healthcare AI Assistants through Iterative Preference Alignment

Huy Nghiem, Swetasudha Panda, Devashish Khatwani, Huy V. Nguyen, Krishnaram Kenthapadi, Hal Daumé III
Proceedings of the Fifth Machine Learning for Health Symposium, PMLR 297:661-696, 2026.

Abstract

Large Language Models ({LLM}s) are increasingly used in healthcare, yet ensuring their safety and trustworthiness remains a barrier to deployment. Conversational medical assistants must avoid unsafe compliance without over-refusing benign queries. We present an iterative post-deployment alignment framework that applies Kahneman–Tversky Optimization ({KTO}) and Direct Preference Optimization ({DPO}) to refine models against domain-specific safety signals. Using the {CARES}-18K benchmark for adversarial robustness, we evaluate four {LLM}s (Llama-3B/8B, Meditron-8B, Mistral-7B) across multiple cycles. Our results show up to 42% improvement in safety-related metrics for harmful query detection, alongside interesting trade-offs against erroneous refusals, thereby exposing architecture-dependent calibration biases. We also perform ablation studies to identify when self-evaluation is reliable and when external or finetuned judges are necessary to maximize performance gains. Our findings underscore the importance of adopting best practices that balance patient safety, user trust, and clinical utility in the design of conversational medical assistants.

Cite this Paper


BibTeX
@InProceedings{pmlr-v297-nghiem26a, title = {Balancing Safety and Helpfulness in Healthcare {AI} Assistants through Iterative Preference Alignment}, author = {Nghiem, Huy and Panda, Swetasudha and Khatwani, Devashish and Nguyen, Huy V. and Kenthapadi, Krishnaram and Daum{\'e} III, Hal}, booktitle = {Proceedings of the Fifth Machine Learning for Health Symposium}, pages = {661--696}, year = {2026}, editor = {Argaw, Peniel and Zhang, Haoran and Jabbour, Sarah and Chandak, Payal and Ji, Jerry and Mukherjee, Sumit and Salaudeen, Olawale and Chang, Trenton and Healey, Elizabeth and Gröger, Fabian and Adibi, Amin and Hegselmann, Stefan and Wild, Benjamin and Noori, Ayush}, volume = {297}, series = {Proceedings of Machine Learning Research}, month = {13--14 Dec}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v297/main/assets/nghiem26a/nghiem26a.pdf}, url = {https://proceedings.mlr.press/v297/nghiem26a.html}, abstract = {Large Language Models ({LLM}s) are increasingly used in healthcare, yet ensuring their safety and trustworthiness remains a barrier to deployment. Conversational medical assistants must avoid unsafe compliance without over-refusing benign queries. We present an iterative post-deployment alignment framework that applies Kahneman–Tversky Optimization ({KTO}) and Direct Preference Optimization ({DPO}) to refine models against domain-specific safety signals. Using the {CARES}-18K benchmark for adversarial robustness, we evaluate four {LLM}s (Llama-3B/8B, Meditron-8B, Mistral-7B) across multiple cycles. Our results show up to 42% improvement in safety-related metrics for harmful query detection, alongside interesting trade-offs against erroneous refusals, thereby exposing architecture-dependent calibration biases. We also perform ablation studies to identify when self-evaluation is reliable and when external or finetuned judges are necessary to maximize performance gains. Our findings underscore the importance of adopting best practices that balance patient safety, user trust, and clinical utility in the design of conversational medical assistants.} }
Endnote
%0 Conference Paper %T Balancing Safety and Helpfulness in Healthcare AI Assistants through Iterative Preference Alignment %A Huy Nghiem %A Swetasudha Panda %A Devashish Khatwani %A Huy V. Nguyen %A Krishnaram Kenthapadi %A Hal Daumé III %B Proceedings of the Fifth Machine Learning for Health Symposium %C Proceedings of Machine Learning Research %D 2026 %E Peniel Argaw %E Haoran Zhang %E Sarah Jabbour %E Payal Chandak %E Jerry Ji %E Sumit Mukherjee %E Olawale Salaudeen %E Trenton Chang %E Elizabeth Healey %E Fabian Gröger %E Amin Adibi %E Stefan Hegselmann %E Benjamin Wild %E Ayush Noori %F pmlr-v297-nghiem26a %I PMLR %P 661--696 %U https://proceedings.mlr.press/v297/nghiem26a.html %V 297 %X Large Language Models ({LLM}s) are increasingly used in healthcare, yet ensuring their safety and trustworthiness remains a barrier to deployment. Conversational medical assistants must avoid unsafe compliance without over-refusing benign queries. We present an iterative post-deployment alignment framework that applies Kahneman–Tversky Optimization ({KTO}) and Direct Preference Optimization ({DPO}) to refine models against domain-specific safety signals. Using the {CARES}-18K benchmark for adversarial robustness, we evaluate four {LLM}s (Llama-3B/8B, Meditron-8B, Mistral-7B) across multiple cycles. Our results show up to 42% improvement in safety-related metrics for harmful query detection, alongside interesting trade-offs against erroneous refusals, thereby exposing architecture-dependent calibration biases. We also perform ablation studies to identify when self-evaluation is reliable and when external or finetuned judges are necessary to maximize performance gains. Our findings underscore the importance of adopting best practices that balance patient safety, user trust, and clinical utility in the design of conversational medical assistants.
APA
Nghiem, H., Panda, S., Khatwani, D., Nguyen, H.V., Kenthapadi, K. & Daumé III, H.. (2026). Balancing Safety and Helpfulness in Healthcare AI Assistants through Iterative Preference Alignment. Proceedings of the Fifth Machine Learning for Health Symposium, in Proceedings of Machine Learning Research 297:661-696 Available from https://proceedings.mlr.press/v297/nghiem26a.html.

Related Material