Clinical-R1: Empowering Large Language Models for Faithful and Comprehensive Reasoning with Clinical Objective Relative Policy Optimization

Boyang Gu; Hongjian Zhou; Bradley Max Segal; Jinge Wu; Zeyu Cao; Hantao Zhong; Lei Clifton; Fenglin Liu; David A. Clifton

Clinical-R1: Empowering Large Language Models for Faithful and Comprehensive Reasoning with Clinical Objective Relative Policy Optimization

Boyang Gu, Hongjian Zhou, Bradley Max Segal, Jinge Wu, Zeyu Cao, Hantao Zhong, Lei Clifton, Fenglin Liu, David A. Clifton

Proceedings of The Second AAAI Bridge Program on AI for Medicine and Healthcare, PMLR 317:117-126, 2026.

Abstract

Recent advances in large language models (LLMs) have shown strong reasoning capabilities through large-scale pre-training and post-training reinforcement learning, demonstrated by DeepSeek-R1. However, current post-training methods, such as Grouped Relative Policy Optimization (GRPO), mainly reward correctness, which is not aligned with the multi-dimensional objectives required in high-stakes fields such as medicine, where reasoning must also be faithful and comprehensive. We introduce Clinical-Objective Relative Policy Optimization (CRPO), a scalable, multi-objective, verifiable reinforcement learning method designed to align LLM post-training with clinical reasoning principles. CRPO integrates rule-based and verifiable reward signals that jointly optimize accuracy, faithfulness, and comprehensiveness without relying on human annotation. To demonstrate its effectiveness, we train Clinical-R1-3B, a 3B-parameter model for clinical reasoning. The experiments on three benchmarks demonstrate that our CRPO substantially improves reasoning on truthfulness and completeness over standard GRPO while maintaining comfortable accuracy enhancements. This framework provides a scalable pathway to align LLM reasoning with clinical objectives, enabling safer and more collaborative AI systems for healthcare while also highlighting the potential of multi-objective, verifiable RL methods in post-training scaling of LLMs for medical domains. Our data, models, and code are all publicly available on https://github.com/BoyangGu1/Clinical-R1-3B.

Cite this Paper

BibTeX

@InProceedings{pmlr-v317-gu26a,
  title = 	 {Clinical-R1: Empowering Large Language Models for Faithful and Comprehensive Reasoning with Clinical Objective Relative Policy Optimization},
  author =       {Gu, Boyang and Zhou, Hongjian and Segal, Bradley Max and Wu, Jinge and Cao, Zeyu and Zhong, Hantao and Clifton, Lei and Liu, Fenglin and Clifton, David A.},
  booktitle = 	 {Proceedings of The Second AAAI Bridge Program on AI for Medicine and Healthcare},
  pages = 	 {117--126},
  year = 	 {2026},
  editor = 	 {Wu, Junde and Pan, Jiazhen and Zhu, Jiayuan and Luo, Luyang and Li, Yitong and Xu, Min and Jin, Yueming and Rueckert, Daniel},
  volume = 	 {317},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {20--21 Jan},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v317/main/assets/gu26a/gu26a.pdf},
  url = 	 {https://proceedings.mlr.press/v317/gu26a.html},
  abstract = 	 {Recent advances in large language models (LLMs) have shown strong reasoning capabilities through large-scale pre-training and post-training reinforcement learning, demonstrated by DeepSeek-R1. However, current post-training methods, such as Grouped Relative Policy Optimization (GRPO), mainly reward correctness, which is not aligned with the multi-dimensional objectives required in high-stakes fields such as medicine, where reasoning must also be faithful and comprehensive. We introduce Clinical-Objective Relative Policy Optimization (CRPO), a scalable, multi-objective, verifiable reinforcement learning method designed to align LLM post-training with clinical reasoning principles. CRPO integrates rule-based and verifiable reward signals that jointly optimize accuracy, faithfulness, and comprehensiveness without relying on human annotation. To demonstrate its effectiveness, we train Clinical-R1-3B, a 3B-parameter model for clinical reasoning. The experiments on three benchmarks demonstrate that our CRPO substantially improves reasoning on truthfulness and completeness over standard GRPO while maintaining comfortable accuracy enhancements. This framework provides a scalable pathway to align LLM reasoning with clinical objectives, enabling safer and more collaborative AI systems for healthcare while also highlighting the potential of multi-objective, verifiable RL methods in post-training scaling of LLMs for medical domains. Our data, models, and code are all publicly available on https://github.com/BoyangGu1/Clinical-R1-3B.}
}

Endnote

%0 Conference Paper
%T Clinical-R1: Empowering Large Language Models for Faithful and Comprehensive Reasoning with Clinical Objective Relative Policy Optimization
%A Boyang Gu
%A Hongjian Zhou
%A Bradley Max Segal
%A Jinge Wu
%A Zeyu Cao
%A Hantao Zhong
%A Lei Clifton
%A Fenglin Liu
%A David A. Clifton
%B Proceedings of The Second AAAI Bridge Program on AI for Medicine and Healthcare
%C Proceedings of Machine Learning Research
%D 2026
%E Junde Wu
%E Jiazhen Pan
%E Jiayuan Zhu
%E Luyang Luo
%E Yitong Li
%E Min Xu
%E Yueming Jin
%E Daniel Rueckert	
%F pmlr-v317-gu26a
%I PMLR
%P 117--126
%U https://proceedings.mlr.press/v317/gu26a.html
%V 317
%X Recent advances in large language models (LLMs) have shown strong reasoning capabilities through large-scale pre-training and post-training reinforcement learning, demonstrated by DeepSeek-R1. However, current post-training methods, such as Grouped Relative Policy Optimization (GRPO), mainly reward correctness, which is not aligned with the multi-dimensional objectives required in high-stakes fields such as medicine, where reasoning must also be faithful and comprehensive. We introduce Clinical-Objective Relative Policy Optimization (CRPO), a scalable, multi-objective, verifiable reinforcement learning method designed to align LLM post-training with clinical reasoning principles. CRPO integrates rule-based and verifiable reward signals that jointly optimize accuracy, faithfulness, and comprehensiveness without relying on human annotation. To demonstrate its effectiveness, we train Clinical-R1-3B, a 3B-parameter model for clinical reasoning. The experiments on three benchmarks demonstrate that our CRPO substantially improves reasoning on truthfulness and completeness over standard GRPO while maintaining comfortable accuracy enhancements. This framework provides a scalable pathway to align LLM reasoning with clinical objectives, enabling safer and more collaborative AI systems for healthcare while also highlighting the potential of multi-objective, verifiable RL methods in post-training scaling of LLMs for medical domains. Our data, models, and code are all publicly available on https://github.com/BoyangGu1/Clinical-R1-3B.

APA

Gu, B., Zhou, H., Segal, B.M., Wu, J., Cao, Z., Zhong, H., Clifton, L., Liu, F. & Clifton, D.A.. (2026). Clinical-R1: Empowering Large Language Models for Faithful and Comprehensive Reasoning with Clinical Objective Relative Policy Optimization. Proceedings of The Second AAAI Bridge Program on AI for Medicine and Healthcare, in Proceedings of Machine Learning Research 317:117-126 Available from https://proceedings.mlr.press/v317/gu26a.html.

Related Material

Download PDF