Decision-Point Guided Safe Policy Improvement

Abhishek Sharma, Leo Benac, Sonali Parbhoo, Finale Doshi-Velez
Proceedings of The 28th International Conference on Artificial Intelligence and Statistics, PMLR 258:2935-2943, 2025.

Abstract

Within batch reinforcement learning, safe policy improvement seeks to ensure that the learned policy performs at least as well as the behavior policy that generated the dataset. The core challenge is seeking improvements while balancing risk when many state-action pairs may be infrequently visited. In this work, we introduce Decision Points RL (DPRL), an algorithm that restricts the set of state-action pairs (or regions for continuous states) considered for improvement. DPRL ensures high-confidence improvement in densely visited states (called ‘decision points’) while still utilizing data from sparsely visited states by using them for trajectory-based value estimates. By selectively limiting the state-actions where the policy deviates from the behavior, we achieve tighter theoretical guarantees that depend only on the counts of frequently observed state-action pairs rather than on state-action space size. Our empirical results confirm DPRL provides both safety and performance improvements across synthetic and real-world applications.

Cite this Paper


BibTeX
@InProceedings{pmlr-v258-sharma25a, title = {Decision-Point Guided Safe Policy Improvement}, author = {Sharma, Abhishek and Benac, Leo and Parbhoo, Sonali and Doshi-Velez, Finale}, booktitle = {Proceedings of The 28th International Conference on Artificial Intelligence and Statistics}, pages = {2935--2943}, year = {2025}, editor = {Li, Yingzhen and Mandt, Stephan and Agrawal, Shipra and Khan, Emtiyaz}, volume = {258}, series = {Proceedings of Machine Learning Research}, month = {03--05 May}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v258/main/assets/sharma25a/sharma25a.pdf}, url = {https://proceedings.mlr.press/v258/sharma25a.html}, abstract = {Within batch reinforcement learning, safe policy improvement seeks to ensure that the learned policy performs at least as well as the behavior policy that generated the dataset. The core challenge is seeking improvements while balancing risk when many state-action pairs may be infrequently visited. In this work, we introduce Decision Points RL (DPRL), an algorithm that restricts the set of state-action pairs (or regions for continuous states) considered for improvement. DPRL ensures high-confidence improvement in densely visited states (called ‘decision points’) while still utilizing data from sparsely visited states by using them for trajectory-based value estimates. By selectively limiting the state-actions where the policy deviates from the behavior, we achieve tighter theoretical guarantees that depend only on the counts of frequently observed state-action pairs rather than on state-action space size. Our empirical results confirm DPRL provides both safety and performance improvements across synthetic and real-world applications.} }
Endnote
%0 Conference Paper %T Decision-Point Guided Safe Policy Improvement %A Abhishek Sharma %A Leo Benac %A Sonali Parbhoo %A Finale Doshi-Velez %B Proceedings of The 28th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2025 %E Yingzhen Li %E Stephan Mandt %E Shipra Agrawal %E Emtiyaz Khan %F pmlr-v258-sharma25a %I PMLR %P 2935--2943 %U https://proceedings.mlr.press/v258/sharma25a.html %V 258 %X Within batch reinforcement learning, safe policy improvement seeks to ensure that the learned policy performs at least as well as the behavior policy that generated the dataset. The core challenge is seeking improvements while balancing risk when many state-action pairs may be infrequently visited. In this work, we introduce Decision Points RL (DPRL), an algorithm that restricts the set of state-action pairs (or regions for continuous states) considered for improvement. DPRL ensures high-confidence improvement in densely visited states (called ‘decision points’) while still utilizing data from sparsely visited states by using them for trajectory-based value estimates. By selectively limiting the state-actions where the policy deviates from the behavior, we achieve tighter theoretical guarantees that depend only on the counts of frequently observed state-action pairs rather than on state-action space size. Our empirical results confirm DPRL provides both safety and performance improvements across synthetic and real-world applications.
APA
Sharma, A., Benac, L., Parbhoo, S. & Doshi-Velez, F.. (2025). Decision-Point Guided Safe Policy Improvement. Proceedings of The 28th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 258:2935-2943 Available from https://proceedings.mlr.press/v258/sharma25a.html.

Related Material