Trust Region Meta Learning for Policy Optimization

Manuel Occorso, Luca Sabbioni, Alberto Maria Metelli, Marcello Restelli
ECMLPKDD Workshop on Meta-Knowledge Transfer, PMLR 191:62-74, 2022.

Abstract

Reinforcement Learning aims to train autonomous agents in their interaction with the environment by means of maximizing a given reward signal; in the last decade there has been an explosion of new algorithms, which make extensive use of hyper-parameters to control their behaviour, accuracy and speed. Often those hyper-parameters are fine-tuned by hand, and the selected values may change drastically the learning performance of the algorithm; furthermore, it happens to train multiple agents on very similar problems, starting from scratch each time. Our goal is to design a Meta-Reinforcement Learning algorithm to optimize the hyper-parameter of a well-known RL algorithm, named Trust Region Policy Optimization. We use knowledge from previous learning sessions and another RL algorithm, Fitted-Q Iteration, to build a policy-agnostic Meta-Model capable to predict the optimal hyper-parameter for TRPO at each of its steps, on new unseen problems, generalizing across different tasks and policy spaces.

Cite this Paper


BibTeX
@InProceedings{pmlr-v191-occorso22a, title = {Trust Region Meta Learning for Policy Optimization}, author = {Occorso, Manuel and Sabbioni, Luca and Metelli, Alberto Maria and Restelli, Marcello}, booktitle = {ECMLPKDD Workshop on Meta-Knowledge Transfer}, pages = {62--74}, year = {2022}, editor = {Brazdil, Pavel and van Rijn, Jan N. and Gouk, Henry and Mohr, Felix}, volume = {191}, series = {Proceedings of Machine Learning Research}, month = {23 Sep}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v191/occorso22a/occorso22a.pdf}, url = {https://proceedings.mlr.press/v191/occorso22a.html}, abstract = {Reinforcement Learning aims to train autonomous agents in their interaction with the environment by means of maximizing a given reward signal; in the last decade there has been an explosion of new algorithms, which make extensive use of hyper-parameters to control their behaviour, accuracy and speed. Often those hyper-parameters are fine-tuned by hand, and the selected values may change drastically the learning performance of the algorithm; furthermore, it happens to train multiple agents on very similar problems, starting from scratch each time. Our goal is to design a Meta-Reinforcement Learning algorithm to optimize the hyper-parameter of a well-known RL algorithm, named Trust Region Policy Optimization. We use knowledge from previous learning sessions and another RL algorithm, Fitted-Q Iteration, to build a policy-agnostic Meta-Model capable to predict the optimal hyper-parameter for TRPO at each of its steps, on new unseen problems, generalizing across different tasks and policy spaces.} }
Endnote
%0 Conference Paper %T Trust Region Meta Learning for Policy Optimization %A Manuel Occorso %A Luca Sabbioni %A Alberto Maria Metelli %A Marcello Restelli %B ECMLPKDD Workshop on Meta-Knowledge Transfer %C Proceedings of Machine Learning Research %D 2022 %E Pavel Brazdil %E Jan N. van Rijn %E Henry Gouk %E Felix Mohr %F pmlr-v191-occorso22a %I PMLR %P 62--74 %U https://proceedings.mlr.press/v191/occorso22a.html %V 191 %X Reinforcement Learning aims to train autonomous agents in their interaction with the environment by means of maximizing a given reward signal; in the last decade there has been an explosion of new algorithms, which make extensive use of hyper-parameters to control their behaviour, accuracy and speed. Often those hyper-parameters are fine-tuned by hand, and the selected values may change drastically the learning performance of the algorithm; furthermore, it happens to train multiple agents on very similar problems, starting from scratch each time. Our goal is to design a Meta-Reinforcement Learning algorithm to optimize the hyper-parameter of a well-known RL algorithm, named Trust Region Policy Optimization. We use knowledge from previous learning sessions and another RL algorithm, Fitted-Q Iteration, to build a policy-agnostic Meta-Model capable to predict the optimal hyper-parameter for TRPO at each of its steps, on new unseen problems, generalizing across different tasks and policy spaces.
APA
Occorso, M., Sabbioni, L., Metelli, A.M. & Restelli, M.. (2022). Trust Region Meta Learning for Policy Optimization. ECMLPKDD Workshop on Meta-Knowledge Transfer, in Proceedings of Machine Learning Research 191:62-74 Available from https://proceedings.mlr.press/v191/occorso22a.html.

Related Material