Learn With Imagination: Safe Set Guided State-wise Constrained Policy Optimization

Yifan Sun, Feihan Li, Weiye Zhao, Rui Chen, Tianhao Wei, Changliu Liu
Proceedings of the 7th Annual Learning for Dynamics \& Control Conference, PMLR 283:298-309, 2025.

Abstract

Deep reinforcement learning (RL) excels in various control tasks, yet the absence of safety guarantees hampers its real-world applicability. In particular, explorations during learning usually results in safety violations, while the RL agent learns from those mistakes. On the other hand, safe control techniques ensure persistent safety satisfaction but demand strong priors on system dynamics, which is usually hard to obtain in practice. To address these problems, we present Safe Set Guided State-wise Constrained Policy Optimization (S-3PO), a pioneering algorithm generating state-wise safe optimal policies with zero training violations, i.e., learning without mistakes. S-3PO first employs a safety-oriented monitor with black-box dynamics to ensure safe exploration. It then enforces an "imaginary" cost for the RL agent to converge to optimal behaviors within safety constraints. S-3PO outperforms existing methods in high-dimensional robotics tasks, managing state-wise constraints with zero training violation. This innovation marks a significant stride towards real-world safe RL deployment.

Cite this Paper


BibTeX
@InProceedings{pmlr-v283-sun25a, title = {Learn With Imagination: Safe Set Guided State-wise Constrained Policy Optimization}, author = {Sun, Yifan and Li, Feihan and Zhao, Weiye and Chen, Rui and Wei, Tianhao and Liu, Changliu}, booktitle = {Proceedings of the 7th Annual Learning for Dynamics \& Control Conference}, pages = {298--309}, year = {2025}, editor = {Ozay, Necmiye and Balzano, Laura and Panagou, Dimitra and Abate, Alessandro}, volume = {283}, series = {Proceedings of Machine Learning Research}, month = {04--06 Jun}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v283/main/assets/sun25a/sun25a.pdf}, url = {https://proceedings.mlr.press/v283/sun25a.html}, abstract = {Deep reinforcement learning (RL) excels in various control tasks, yet the absence of safety guarantees hampers its real-world applicability. In particular, explorations during learning usually results in safety violations, while the RL agent learns from those mistakes. On the other hand, safe control techniques ensure persistent safety satisfaction but demand strong priors on system dynamics, which is usually hard to obtain in practice. To address these problems, we present Safe Set Guided State-wise Constrained Policy Optimization (S-3PO), a pioneering algorithm generating state-wise safe optimal policies with zero training violations, i.e., learning without mistakes. S-3PO first employs a safety-oriented monitor with black-box dynamics to ensure safe exploration. It then enforces an "imaginary" cost for the RL agent to converge to optimal behaviors within safety constraints. S-3PO outperforms existing methods in high-dimensional robotics tasks, managing state-wise constraints with zero training violation. This innovation marks a significant stride towards real-world safe RL deployment.} }
Endnote
%0 Conference Paper %T Learn With Imagination: Safe Set Guided State-wise Constrained Policy Optimization %A Yifan Sun %A Feihan Li %A Weiye Zhao %A Rui Chen %A Tianhao Wei %A Changliu Liu %B Proceedings of the 7th Annual Learning for Dynamics \& Control Conference %C Proceedings of Machine Learning Research %D 2025 %E Necmiye Ozay %E Laura Balzano %E Dimitra Panagou %E Alessandro Abate %F pmlr-v283-sun25a %I PMLR %P 298--309 %U https://proceedings.mlr.press/v283/sun25a.html %V 283 %X Deep reinforcement learning (RL) excels in various control tasks, yet the absence of safety guarantees hampers its real-world applicability. In particular, explorations during learning usually results in safety violations, while the RL agent learns from those mistakes. On the other hand, safe control techniques ensure persistent safety satisfaction but demand strong priors on system dynamics, which is usually hard to obtain in practice. To address these problems, we present Safe Set Guided State-wise Constrained Policy Optimization (S-3PO), a pioneering algorithm generating state-wise safe optimal policies with zero training violations, i.e., learning without mistakes. S-3PO first employs a safety-oriented monitor with black-box dynamics to ensure safe exploration. It then enforces an "imaginary" cost for the RL agent to converge to optimal behaviors within safety constraints. S-3PO outperforms existing methods in high-dimensional robotics tasks, managing state-wise constraints with zero training violation. This innovation marks a significant stride towards real-world safe RL deployment.
APA
Sun, Y., Li, F., Zhao, W., Chen, R., Wei, T. & Liu, C.. (2025). Learn With Imagination: Safe Set Guided State-wise Constrained Policy Optimization. Proceedings of the 7th Annual Learning for Dynamics \& Control Conference, in Proceedings of Machine Learning Research 283:298-309 Available from https://proceedings.mlr.press/v283/sun25a.html.

Related Material