Model-Based Reinforcement Learning via Meta-Policy Optimization

Ignasi Clavera; Jonas Rothfuss; John Schulman; Yasuhiro Fujita; Tamim Asfour; Pieter Abbeel

Model-Based Reinforcement Learning via Meta-Policy Optimization

Ignasi Clavera, Jonas Rothfuss, John Schulman, Yasuhiro Fujita, Tamim Asfour, Pieter Abbeel

Proceedings of The 2nd Conference on Robot Learning, PMLR 87:617-629, 2018.

Abstract

Model-based reinforcement learning approaches carry the promise of being data efficient. However, due to challenges in learning dynamics models that sufficiently match the real-world dynamics, they struggle to achieve the same asymptotic performance as model-free methods. We propose Model-Based Meta-Policy-Optimization (MB-MPO), an approach that foregoes the strong reliance on accurate learned dynamics models. Using an ensemble of learned dynamic models, MB-MPO meta-learns a policy that can quickly adapt to any model in the ensemble with one policy gradient step. This steers the meta-policy towards internalizing consistent dynamics predictions among the ensemble while shifting the burden of behaving optimally w.r.t. the model discrepancies towards the adaptation step. Our experiments show that MB-MPO is more robust to model imperfections than previous model-based approaches. Finally, we demonstrate that our approach is able to match the asymptotic performance of model-free methods while requiring significantly less experience.

Cite this Paper

BibTeX


@InProceedings{pmlr-v87-clavera18a,
  title = 	 {Model-Based Reinforcement Learning via Meta-Policy Optimization},
  author =       {Clavera, Ignasi and Rothfuss, Jonas and Schulman, John and Fujita, Yasuhiro and Asfour, Tamim and Abbeel, Pieter},
  booktitle = 	 {Proceedings of The 2nd Conference on Robot Learning},
  pages = 	 {617--629},
  year = 	 {2018},
  editor = 	 {Billard, Aude and Dragan, Anca and Peters, Jan and Morimoto, Jun},
  volume = 	 {87},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {29--31 Oct},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v87/clavera18a/clavera18a.pdf},
  url = 	 {https://proceedings.mlr.press/v87/clavera18a.html},
  abstract = 	 {Model-based reinforcement learning approaches carry the promise of being data efficient. However, due to challenges in learning dynamics models that sufficiently match the real-world dynamics, they struggle to achieve the same asymptotic performance as model-free methods. We propose Model-Based Meta-Policy-Optimization (MB-MPO), an approach that foregoes the strong reliance on accurate learned dynamics models. Using an ensemble of learned dynamic models, MB-MPO meta-learns a policy that can quickly adapt to any model in the ensemble with one policy gradient step. This steers the meta-policy towards internalizing consistent dynamics predictions among the ensemble while shifting the burden of behaving optimally w.r.t. the model discrepancies towards the adaptation step. Our experiments show that MB-MPO is more robust to model imperfections than previous model-based approaches. Finally, we demonstrate that our approach is able to match the asymptotic performance of model-free methods while requiring significantly less experience. }
}

Endnote

%0 Conference Paper
%T Model-Based Reinforcement Learning via Meta-Policy Optimization
%A Ignasi Clavera
%A Jonas Rothfuss
%A John Schulman
%A Yasuhiro Fujita
%A Tamim Asfour
%A Pieter Abbeel
%B Proceedings of The 2nd Conference on Robot Learning
%C Proceedings of Machine Learning Research
%D 2018
%E Aude Billard
%E Anca Dragan
%E Jan Peters
%E Jun Morimoto	
%F pmlr-v87-clavera18a
%I PMLR
%P 617--629
%U https://proceedings.mlr.press/v87/clavera18a.html
%V 87
%X Model-based reinforcement learning approaches carry the promise of being data efficient. However, due to challenges in learning dynamics models that sufficiently match the real-world dynamics, they struggle to achieve the same asymptotic performance as model-free methods. We propose Model-Based Meta-Policy-Optimization (MB-MPO), an approach that foregoes the strong reliance on accurate learned dynamics models. Using an ensemble of learned dynamic models, MB-MPO meta-learns a policy that can quickly adapt to any model in the ensemble with one policy gradient step. This steers the meta-policy towards internalizing consistent dynamics predictions among the ensemble while shifting the burden of behaving optimally w.r.t. the model discrepancies towards the adaptation step. Our experiments show that MB-MPO is more robust to model imperfections than previous model-based approaches. Finally, we demonstrate that our approach is able to match the asymptotic performance of model-free methods while requiring significantly less experience.

APA


Clavera, I., Rothfuss, J., Schulman, J., Fujita, Y., Asfour, T. & Abbeel, P.. (2018). Model-Based Reinforcement Learning via Meta-Policy Optimization. Proceedings of The 2nd Conference on Robot Learning, in Proceedings of Machine Learning Research 87:617-629 Available from https://proceedings.mlr.press/v87/clavera18a.html.

Related Material

Download PDF