Safety-Polarized and Prioritized Reinforcement Learning

Ke Fan; Jinpeng Zhang; Xuefeng Zhang; Yunze Wu; Jingyu Cao; Yuan Zhou; Jianzhu Ma

Safety-Polarized and Prioritized Reinforcement Learning

Ke Fan, Jinpeng Zhang, Xuefeng Zhang, Yunze Wu, Jingyu Cao, Yuan Zhou, Jianzhu Ma

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:15862-15886, 2025.

Abstract

Motivated by the first priority of safety in many real-world applications, we propose MaxSafe, a chance-constrained bi-level optimization framework for safe reinforcement learning. MaxSafe first minimizes the unsafe probability and then maximizes the return among the safest policies. We provide a tailored Q-learning algorithm for the MaxSafe objective, featuring a novel learning process for optimal action masks with theoretical convergence guarantees. To enable the application of our algorithm to large-scale experiments, we introduce two key techniques: safety polarization and safety prioritized experience replay. Safety polarization generalizes the optimal action masking by polarizing the Q-function, which assigns low values to unsafe state-action pairs, effectively discouraging their selection. In parallel, safety prioritized experience replay enhances the learning of optimal action masks by prioritizing samples based on temporal-difference (TD) errors derived from our proposed state-action reachability estimation functions. This approach efficiently addresses the challenges posed by sparse cost signals. Experiments on diverse autonomous driving and safe control tasks show that our methods achieve near-maximal safety and an optimal reward-safety trade-off.

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-fan25i,
  title = 	 {Safety-Polarized and Prioritized Reinforcement Learning},
  author =       {Fan, Ke and Zhang, Jinpeng and Zhang, Xuefeng and Wu, Yunze and Cao, Jingyu and Zhou, Yuan and Ma, Jianzhu},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {15862--15886},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/fan25i/fan25i.pdf},
  url = 	 {https://proceedings.mlr.press/v267/fan25i.html},
  abstract = 	 {Motivated by the first priority of safety in many real-world applications, we propose MaxSafe, a chance-constrained bi-level optimization framework for safe reinforcement learning. MaxSafe first minimizes the unsafe probability and then maximizes the return among the safest policies. We provide a tailored Q-learning algorithm for the MaxSafe objective, featuring a novel learning process for optimal action masks with theoretical convergence guarantees. To enable the application of our algorithm to large-scale experiments, we introduce two key techniques: safety polarization and safety prioritized experience replay. Safety polarization generalizes the optimal action masking by polarizing the Q-function, which assigns low values to unsafe state-action pairs, effectively discouraging their selection. In parallel, safety prioritized experience replay enhances the learning of optimal action masks by prioritizing samples based on temporal-difference (TD) errors derived from our proposed state-action reachability estimation functions. This approach efficiently addresses the challenges posed by sparse cost signals. Experiments on diverse autonomous driving and safe control tasks show that our methods achieve near-maximal safety and an optimal reward-safety trade-off.}
}

Endnote

%0 Conference Paper
%T Safety-Polarized and Prioritized Reinforcement Learning
%A Ke Fan
%A Jinpeng Zhang
%A Xuefeng Zhang
%A Yunze Wu
%A Jingyu Cao
%A Yuan Zhou
%A Jianzhu Ma
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-fan25i
%I PMLR
%P 15862--15886
%U https://proceedings.mlr.press/v267/fan25i.html
%V 267
%X Motivated by the first priority of safety in many real-world applications, we propose MaxSafe, a chance-constrained bi-level optimization framework for safe reinforcement learning. MaxSafe first minimizes the unsafe probability and then maximizes the return among the safest policies. We provide a tailored Q-learning algorithm for the MaxSafe objective, featuring a novel learning process for optimal action masks with theoretical convergence guarantees. To enable the application of our algorithm to large-scale experiments, we introduce two key techniques: safety polarization and safety prioritized experience replay. Safety polarization generalizes the optimal action masking by polarizing the Q-function, which assigns low values to unsafe state-action pairs, effectively discouraging their selection. In parallel, safety prioritized experience replay enhances the learning of optimal action masks by prioritizing samples based on temporal-difference (TD) errors derived from our proposed state-action reachability estimation functions. This approach efficiently addresses the challenges posed by sparse cost signals. Experiments on diverse autonomous driving and safe control tasks show that our methods achieve near-maximal safety and an optimal reward-safety trade-off.

APA

Fan, K., Zhang, J., Zhang, X., Wu, Y., Cao, J., Zhou, Y. & Ma, J.. (2025). Safety-Polarized and Prioritized Reinforcement Learning. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:15862-15886 Available from https://proceedings.mlr.press/v267/fan25i.html.

Safety-Polarized and Prioritized Reinforcement Learning

Abstract

Cite this Paper

Related Material