Optimistic Policy Optimization via Multiple Importance Sampling

Matteo Papini, Alberto Maria Metelli, Lorenzo Lupo, Marcello Restelli
Proceedings of the 36th International Conference on Machine Learning, PMLR 97:4989-4999, 2019.

Abstract

Policy Search (PS) is an effective approach to Reinforcement Learning (RL) for solving control tasks with continuous state-action spaces. In this paper, we address the exploration-exploitation trade-off in PS by proposing an approach based on Optimism in the Face of Uncertainty. We cast the PS problem as a suitable Multi Armed Bandit (MAB) problem, defined over the policy parameter space, and we propose a class of algorithms that effectively exploit the problem structure, by leveraging Multiple Importance Sampling to perform an off-policy estimation of the expected return. We show that the regret of the proposed approach is bounded by $\widetilde{\mathcal{O}}(\sqrt{T})$ for both discrete and continuous parameter spaces. Finally, we evaluate our algorithms on tasks of varying difficulty, comparing them with existing MAB and RL algorithms.

Cite this Paper


BibTeX
@InProceedings{pmlr-v97-papini19a, title = {Optimistic Policy Optimization via Multiple Importance Sampling}, author = {Papini, Matteo and Metelli, Alberto Maria and Lupo, Lorenzo and Restelli, Marcello}, booktitle = {Proceedings of the 36th International Conference on Machine Learning}, pages = {4989--4999}, year = {2019}, editor = {Chaudhuri, Kamalika and Salakhutdinov, Ruslan}, volume = {97}, series = {Proceedings of Machine Learning Research}, month = {09--15 Jun}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v97/papini19a/papini19a.pdf}, url = {https://proceedings.mlr.press/v97/papini19a.html}, abstract = {Policy Search (PS) is an effective approach to Reinforcement Learning (RL) for solving control tasks with continuous state-action spaces. In this paper, we address the exploration-exploitation trade-off in PS by proposing an approach based on Optimism in the Face of Uncertainty. We cast the PS problem as a suitable Multi Armed Bandit (MAB) problem, defined over the policy parameter space, and we propose a class of algorithms that effectively exploit the problem structure, by leveraging Multiple Importance Sampling to perform an off-policy estimation of the expected return. We show that the regret of the proposed approach is bounded by $\widetilde{\mathcal{O}}(\sqrt{T})$ for both discrete and continuous parameter spaces. Finally, we evaluate our algorithms on tasks of varying difficulty, comparing them with existing MAB and RL algorithms.} }
Endnote
%0 Conference Paper %T Optimistic Policy Optimization via Multiple Importance Sampling %A Matteo Papini %A Alberto Maria Metelli %A Lorenzo Lupo %A Marcello Restelli %B Proceedings of the 36th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2019 %E Kamalika Chaudhuri %E Ruslan Salakhutdinov %F pmlr-v97-papini19a %I PMLR %P 4989--4999 %U https://proceedings.mlr.press/v97/papini19a.html %V 97 %X Policy Search (PS) is an effective approach to Reinforcement Learning (RL) for solving control tasks with continuous state-action spaces. In this paper, we address the exploration-exploitation trade-off in PS by proposing an approach based on Optimism in the Face of Uncertainty. We cast the PS problem as a suitable Multi Armed Bandit (MAB) problem, defined over the policy parameter space, and we propose a class of algorithms that effectively exploit the problem structure, by leveraging Multiple Importance Sampling to perform an off-policy estimation of the expected return. We show that the regret of the proposed approach is bounded by $\widetilde{\mathcal{O}}(\sqrt{T})$ for both discrete and continuous parameter spaces. Finally, we evaluate our algorithms on tasks of varying difficulty, comparing them with existing MAB and RL algorithms.
APA
Papini, M., Metelli, A.M., Lupo, L. & Restelli, M.. (2019). Optimistic Policy Optimization via Multiple Importance Sampling. Proceedings of the 36th International Conference on Machine Learning, in Proceedings of Machine Learning Research 97:4989-4999 Available from https://proceedings.mlr.press/v97/papini19a.html.

Related Material