Safe Policy Learning for Continuous Control

Yinlam Chow; Ofir Nachum; Aleksandra Faust; Edgar Dueñez-Guzman; Mohammad Ghavamzadeh

Safe Policy Learning for Continuous Control

Yinlam Chow, Ofir Nachum, Aleksandra Faust, Edgar Dueñez-Guzman, Mohammad Ghavamzadeh

Proceedings of the 2020 Conference on Robot Learning, PMLR 155:801-821, 2021.

Abstract

We study continuous action reinforcement learning problems in which it is crucial that the agent interacts with the environment only through near-safe policies, i.e., policies that keep the agent in desirable situations, both during training and at convergence. We formulate these problems as {\em constrained} Markov decision processes (CMDPs) and present safe policy optimization algorithms that are based on a Lyapunov approach to solve them. Our algorithms can use any standard policy gradient (PG) method, such as deep deterministic policy gradient (DDPG) or proximal policy optimization (PPO), to train a neural network policy, while enforcing near-constraint satisfaction for every policy update by projecting either the policy parameter or the selected action onto the set of feasible solutions induced by the state-dependent linearized Lyapunov constraints. Compared to the existing constrained PG algorithms, ours are more data efficient as they are able to utilize both on-policy and off-policy data. Moreover, in practice our action-projection algorithm often leads to less conservative policy updates and allows for natural integration into an end-to-end PG training pipeline. We evaluate our algorithms and compare them with the state-of-the-art baselines on several simulated (MuJoCo) tasks, as well as a real-world robot obstacle-avoidance problem, demonstrating their effectiveness in terms of balancing performance and constraint satisfaction.

Cite this Paper

BibTeX


@InProceedings{pmlr-v155-chow21a,
  title = 	 {Safe Policy Learning for Continuous Control},
  author =       {Chow, Yinlam and Nachum, Ofir and Faust, Aleksandra and Due\~{n}ez-Guzman, Edgar and Ghavamzadeh, Mohammad},
  booktitle = 	 {Proceedings of the 2020 Conference on Robot Learning},
  pages = 	 {801--821},
  year = 	 {2021},
  editor = 	 {Kober, Jens and Ramos, Fabio and Tomlin, Claire},
  volume = 	 {155},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {16--18 Nov},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v155/chow21a/chow21a.pdf},
  url = 	 {https://proceedings.mlr.press/v155/chow21a.html},
  abstract = 	 {We study continuous action reinforcement learning problems in which it is crucial that the agent interacts with the environment only through near-safe policies, i.e., policies that keep the agent in desirable situations, both during training and at convergence. We formulate these problems as {\em constrained} Markov decision processes (CMDPs) and present safe policy optimization algorithms that are based on a Lyapunov approach to solve them. Our algorithms can use any standard policy gradient (PG) method, such as deep deterministic policy gradient (DDPG) or proximal policy optimization (PPO), to train a neural network policy, while enforcing near-constraint satisfaction for every policy update by projecting either the policy parameter or the selected action onto the set of feasible solutions induced by the state-dependent linearized Lyapunov constraints. Compared to the existing constrained PG algorithms, ours are more data efficient as they are able to utilize both on-policy and off-policy data. Moreover, in practice our action-projection algorithm often leads to less conservative policy updates and allows for natural integration into an end-to-end PG training pipeline. We evaluate our algorithms and compare them with the state-of-the-art baselines on several simulated (MuJoCo) tasks, as well as a real-world robot obstacle-avoidance problem, demonstrating their effectiveness in terms of balancing performance and constraint satisfaction.}
}

Endnote

%0 Conference Paper
%T Safe Policy Learning for Continuous Control
%A Yinlam Chow
%A Ofir Nachum
%A Aleksandra Faust
%A Edgar Dueñez-Guzman
%A Mohammad Ghavamzadeh
%B Proceedings of the 2020 Conference on Robot Learning
%C Proceedings of Machine Learning Research
%D 2021
%E Jens Kober
%E Fabio Ramos
%E Claire Tomlin	
%F pmlr-v155-chow21a
%I PMLR
%P 801--821
%U https://proceedings.mlr.press/v155/chow21a.html
%V 155
%X We study continuous action reinforcement learning problems in which it is crucial that the agent interacts with the environment only through near-safe policies, i.e., policies that keep the agent in desirable situations, both during training and at convergence. We formulate these problems as {\em constrained} Markov decision processes (CMDPs) and present safe policy optimization algorithms that are based on a Lyapunov approach to solve them. Our algorithms can use any standard policy gradient (PG) method, such as deep deterministic policy gradient (DDPG) or proximal policy optimization (PPO), to train a neural network policy, while enforcing near-constraint satisfaction for every policy update by projecting either the policy parameter or the selected action onto the set of feasible solutions induced by the state-dependent linearized Lyapunov constraints. Compared to the existing constrained PG algorithms, ours are more data efficient as they are able to utilize both on-policy and off-policy data. Moreover, in practice our action-projection algorithm often leads to less conservative policy updates and allows for natural integration into an end-to-end PG training pipeline. We evaluate our algorithms and compare them with the state-of-the-art baselines on several simulated (MuJoCo) tasks, as well as a real-world robot obstacle-avoidance problem, demonstrating their effectiveness in terms of balancing performance and constraint satisfaction.

APA


Chow, Y., Nachum, O., Faust, A., Dueñez-Guzman, E. & Ghavamzadeh, M.. (2021). Safe Policy Learning for Continuous Control. Proceedings of the 2020 Conference on Robot Learning, in Proceedings of Machine Learning Research 155:801-821 Available from https://proceedings.mlr.press/v155/chow21a.html.

Related Material

Download PDF