[edit]
Generalized constraint for probabilistic safe reinforcement learning
Proceedings of the 6th Annual Learning for Dynamics & Control Conference, PMLR 242:1606-1618, 2024.
Abstract
In this paper, we consider the problem of learning policies for probabilistic safe reinforcement learning (PSRL). Specifically, a safe policy or controller is one that, with high probability, maintains the trajectory of the agent in a given safe set. While the explicit gradient of the probabilistic constraint for solving PSRL directly exists, the high variance in the estimate of the gradient hinders its performance in problems with long-horizons. An alternative that is frequently explored in the literature is to consider a cumulative safe reinforcement learning (CSRL) setting. In this setting, the estimates of the constraint’s gradient have less variance but are biased (worse solutions than the PSRL) and they provide an approximate solution since they solve a relaxation of the PSRL formulation. In this work, we propose a safe reinforcement learning framework with a generalized constraint for solving the PSRL problems, which we term Generalized Safe Reinforcement Learning (GSRL). Our theoretical contributions substantiate that the proposed GSRL can recover both the PSRL and CSRL settings. In addition, it can be naturally combined with any state-of-the-art safe RL algorithms like PPO-Lagrangian, TD3-Lagrangian, CPO, PCPO, etc. We evaluate the GSRL by a series of empirical experiments in the well-known safe RL benchmark Bullet-Safety-Gym, which exhibit a better return-safety trade-off than both the PSRL and CSRL formulations.