Balancing Learning Speed and Stability in Policy Gradient via Adaptive Exploration

Matteo Papini, Andrea Battistello, Marcello Restelli
Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, PMLR 108:1188-1199, 2020.

Abstract

In many Reinforcement Learning (RL) applications, the goal is to find an optimal deterministic policy. However, most RL algorithms require the policy to be stochastic in order to avoid instabilities and perform a sufficient amount of exploration. Adjusting the level of stochasticity during the learning process is non-trivial, as it is difficult to assess whether the costs of random exploration will be repaid in the long run, and to contain the risk of instability.We study this problem in the context of policy gradients (PG) with Gaussian policies. Using tools from the safe PG literature, we design a surrogate objective for the policy variance that captures the effects this parameter has on the learning speed and on the quality of the final solution. Furthermore, we provide a way to optimize this objective that guarantees stable improvement of the original performance measure. We evaluate the proposed methods on simulated continuous control tasks.

Cite this Paper


BibTeX
@InProceedings{pmlr-v108-papini20a, title = {Balancing Learning Speed and Stability in Policy Gradient via Adaptive Exploration}, author = {Papini, Matteo and Battistello, Andrea and Restelli, Marcello}, booktitle = {Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics}, pages = {1188--1199}, year = {2020}, editor = {Chiappa, Silvia and Calandra, Roberto}, volume = {108}, series = {Proceedings of Machine Learning Research}, month = {26--28 Aug}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v108/papini20a/papini20a.pdf}, url = {https://proceedings.mlr.press/v108/papini20a.html}, abstract = { In many Reinforcement Learning (RL) applications, the goal is to find an optimal deterministic policy. However, most RL algorithms require the policy to be stochastic in order to avoid instabilities and perform a sufficient amount of exploration. Adjusting the level of stochasticity during the learning process is non-trivial, as it is difficult to assess whether the costs of random exploration will be repaid in the long run, and to contain the risk of instability.We study this problem in the context of policy gradients (PG) with Gaussian policies. Using tools from the safe PG literature, we design a surrogate objective for the policy variance that captures the effects this parameter has on the learning speed and on the quality of the final solution. Furthermore, we provide a way to optimize this objective that guarantees stable improvement of the original performance measure. We evaluate the proposed methods on simulated continuous control tasks.} }
Endnote
%0 Conference Paper %T Balancing Learning Speed and Stability in Policy Gradient via Adaptive Exploration %A Matteo Papini %A Andrea Battistello %A Marcello Restelli %B Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2020 %E Silvia Chiappa %E Roberto Calandra %F pmlr-v108-papini20a %I PMLR %P 1188--1199 %U https://proceedings.mlr.press/v108/papini20a.html %V 108 %X In many Reinforcement Learning (RL) applications, the goal is to find an optimal deterministic policy. However, most RL algorithms require the policy to be stochastic in order to avoid instabilities and perform a sufficient amount of exploration. Adjusting the level of stochasticity during the learning process is non-trivial, as it is difficult to assess whether the costs of random exploration will be repaid in the long run, and to contain the risk of instability.We study this problem in the context of policy gradients (PG) with Gaussian policies. Using tools from the safe PG literature, we design a surrogate objective for the policy variance that captures the effects this parameter has on the learning speed and on the quality of the final solution. Furthermore, we provide a way to optimize this objective that guarantees stable improvement of the original performance measure. We evaluate the proposed methods on simulated continuous control tasks.
APA
Papini, M., Battistello, A. & Restelli, M.. (2020). Balancing Learning Speed and Stability in Policy Gradient via Adaptive Exploration. Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 108:1188-1199 Available from https://proceedings.mlr.press/v108/papini20a.html.

Related Material