Trust Region Meta Learning for Policy Optimization

Manuel Occorso; Luca Sabbioni; Alberto Maria Metelli; Marcello Restelli

Trust Region Meta Learning for Policy Optimization

Manuel Occorso, Luca Sabbioni, Alberto Maria Metelli, Marcello Restelli

ECMLPKDD Workshop on Meta-Knowledge Transfer, PMLR 191:62-74, 2022.

Abstract

Reinforcement Learning aims to train autonomous agents in their interaction with the environment by means of maximizing a given reward signal; in the last decade there has been an explosion of new algorithms, which make extensive use of hyper-parameters to control their behaviour, accuracy and speed. Often those hyper-parameters are fine-tuned by hand, and the selected values may change drastically the learning performance of the algorithm; furthermore, it happens to train multiple agents on very similar problems, starting from scratch each time. Our goal is to design a Meta-Reinforcement Learning algorithm to optimize the hyper-parameter of a well-known RL algorithm, named Trust Region Policy Optimization. We use knowledge from previous learning sessions and another RL algorithm, Fitted-Q Iteration, to build a policy-agnostic Meta-Model capable to predict the optimal hyper-parameter for TRPO at each of its steps, on new unseen problems, generalizing across different tasks and policy spaces.

Cite this Paper

BibTeX


@InProceedings{pmlr-v191-occorso22a,
  title = 	 {Trust Region Meta Learning for Policy Optimization},
  author =       {Occorso, Manuel and Sabbioni, Luca and Metelli, Alberto Maria and Restelli, Marcello},
  booktitle = 	 {ECMLPKDD Workshop on Meta-Knowledge Transfer},
  pages = 	 {62--74},
  year = 	 {2022},
  editor = 	 {Brazdil, Pavel and van Rijn, Jan N. and Gouk, Henry and Mohr, Felix},
  volume = 	 {191},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {23 Sep},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v191/occorso22a/occorso22a.pdf},
  url = 	 {https://proceedings.mlr.press/v191/occorso22a.html},
  abstract = 	 {Reinforcement Learning aims to train autonomous agents in their interaction with the environment by means of maximizing a given reward signal; in the last decade there has been an explosion of new algorithms, which make extensive use of hyper-parameters to control their behaviour, accuracy and speed. Often those hyper-parameters are fine-tuned by hand, and the selected values may change drastically the learning performance of the algorithm; furthermore, it happens to train multiple agents on very similar problems, starting from scratch each time. Our goal is to design a Meta-Reinforcement Learning algorithm to optimize the hyper-parameter of a well-known RL algorithm, named Trust Region Policy Optimization. We use knowledge from previous learning sessions and another RL algorithm, Fitted-Q Iteration, to build a policy-agnostic Meta-Model capable to predict the optimal hyper-parameter for TRPO at each of its steps, on new unseen problems, generalizing across different tasks and policy spaces.}
}

Endnote

%0 Conference Paper
%T Trust Region Meta Learning for Policy Optimization
%A Manuel Occorso
%A Luca Sabbioni
%A Alberto Maria Metelli
%A Marcello Restelli
%B ECMLPKDD Workshop on Meta-Knowledge Transfer
%C Proceedings of Machine Learning Research
%D 2022
%E Pavel Brazdil
%E Jan N. van Rijn
%E Henry Gouk
%E Felix Mohr	
%F pmlr-v191-occorso22a
%I PMLR
%P 62--74
%U https://proceedings.mlr.press/v191/occorso22a.html
%V 191
%X Reinforcement Learning aims to train autonomous agents in their interaction with the environment by means of maximizing a given reward signal; in the last decade there has been an explosion of new algorithms, which make extensive use of hyper-parameters to control their behaviour, accuracy and speed. Often those hyper-parameters are fine-tuned by hand, and the selected values may change drastically the learning performance of the algorithm; furthermore, it happens to train multiple agents on very similar problems, starting from scratch each time. Our goal is to design a Meta-Reinforcement Learning algorithm to optimize the hyper-parameter of a well-known RL algorithm, named Trust Region Policy Optimization. We use knowledge from previous learning sessions and another RL algorithm, Fitted-Q Iteration, to build a policy-agnostic Meta-Model capable to predict the optimal hyper-parameter for TRPO at each of its steps, on new unseen problems, generalizing across different tasks and policy spaces.

APA


Occorso, M., Sabbioni, L., Metelli, A.M. & Restelli, M.. (2022). Trust Region Meta Learning for Policy Optimization. ECMLPKDD Workshop on Meta-Knowledge Transfer, in Proceedings of Machine Learning Research 191:62-74 Available from https://proceedings.mlr.press/v191/occorso22a.html.

Related Material

Download PDF