Model-Based Reinforcement Learning via Meta-Policy Optimization

Ignasi Clavera, Jonas Rothfuss, John Schulman, Yasuhiro Fujita, Tamim Asfour, Pieter Abbeel
; Proceedings of The 2nd Conference on Robot Learning, PMLR 87:617-629, 2018.

Abstract

Model-based reinforcement learning approaches carry the promise of being data efficient. However, due to challenges in learning dynamics models that sufficiently match the real-world dynamics, they struggle to achieve the same asymptotic performance as model-free methods. We propose Model-Based Meta-Policy-Optimization (MB-MPO), an approach that foregoes the strong reliance on accurate learned dynamics models. Using an ensemble of learned dynamic models, MB-MPO meta-learns a policy that can quickly adapt to any model in the ensemble with one policy gradient step. This steers the meta-policy towards internalizing consistent dynamics predictions among the ensemble while shifting the burden of behaving optimally w.r.t. the model discrepancies towards the adaptation step. Our experiments show that MB-MPO is more robust to model imperfections than previous model-based approaches. Finally, we demonstrate that our approach is able to match the asymptotic performance of model-free methods while requiring significantly less experience.

Cite this Paper


BibTeX
@InProceedings{pmlr-v87-clavera18a, title = {Model-Based Reinforcement Learning via Meta-Policy Optimization}, author = {Clavera, Ignasi and Rothfuss, Jonas and Schulman, John and Fujita, Yasuhiro and Asfour, Tamim and Abbeel, Pieter}, booktitle = {Proceedings of The 2nd Conference on Robot Learning}, pages = {617--629}, year = {2018}, editor = {Aude Billard and Anca Dragan and Jan Peters and Jun Morimoto}, volume = {87}, series = {Proceedings of Machine Learning Research}, address = {}, month = {29--31 Oct}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v87/clavera18a/clavera18a.pdf}, url = {http://proceedings.mlr.press/v87/clavera18a.html}, abstract = {Model-based reinforcement learning approaches carry the promise of being data efficient. However, due to challenges in learning dynamics models that sufficiently match the real-world dynamics, they struggle to achieve the same asymptotic performance as model-free methods. We propose Model-Based Meta-Policy-Optimization (MB-MPO), an approach that foregoes the strong reliance on accurate learned dynamics models. Using an ensemble of learned dynamic models, MB-MPO meta-learns a policy that can quickly adapt to any model in the ensemble with one policy gradient step. This steers the meta-policy towards internalizing consistent dynamics predictions among the ensemble while shifting the burden of behaving optimally w.r.t. the model discrepancies towards the adaptation step. Our experiments show that MB-MPO is more robust to model imperfections than previous model-based approaches. Finally, we demonstrate that our approach is able to match the asymptotic performance of model-free methods while requiring significantly less experience. } }
Endnote
%0 Conference Paper %T Model-Based Reinforcement Learning via Meta-Policy Optimization %A Ignasi Clavera %A Jonas Rothfuss %A John Schulman %A Yasuhiro Fujita %A Tamim Asfour %A Pieter Abbeel %B Proceedings of The 2nd Conference on Robot Learning %C Proceedings of Machine Learning Research %D 2018 %E Aude Billard %E Anca Dragan %E Jan Peters %E Jun Morimoto %F pmlr-v87-clavera18a %I PMLR %J Proceedings of Machine Learning Research %P 617--629 %U http://proceedings.mlr.press %V 87 %W PMLR %X Model-based reinforcement learning approaches carry the promise of being data efficient. However, due to challenges in learning dynamics models that sufficiently match the real-world dynamics, they struggle to achieve the same asymptotic performance as model-free methods. We propose Model-Based Meta-Policy-Optimization (MB-MPO), an approach that foregoes the strong reliance on accurate learned dynamics models. Using an ensemble of learned dynamic models, MB-MPO meta-learns a policy that can quickly adapt to any model in the ensemble with one policy gradient step. This steers the meta-policy towards internalizing consistent dynamics predictions among the ensemble while shifting the burden of behaving optimally w.r.t. the model discrepancies towards the adaptation step. Our experiments show that MB-MPO is more robust to model imperfections than previous model-based approaches. Finally, we demonstrate that our approach is able to match the asymptotic performance of model-free methods while requiring significantly less experience.
APA
Clavera, I., Rothfuss, J., Schulman, J., Fujita, Y., Asfour, T. & Abbeel, P.. (2018). Model-Based Reinforcement Learning via Meta-Policy Optimization. Proceedings of The 2nd Conference on Robot Learning, in PMLR 87:617-629

Related Material