Muesli: Combining Improvements in Policy Optimization

Matteo Hessel; Ivo Danihelka; Fabio Viola; Arthur Guez; Simon Schmitt; Laurent Sifre; Theophane Weber; David Silver; Hado Van Hasselt

Muesli: Combining Improvements in Policy Optimization

Matteo Hessel, Ivo Danihelka, Fabio Viola, Arthur Guez, Simon Schmitt, Laurent Sifre, Theophane Weber, David Silver, Hado Van Hasselt

Proceedings of the 38th International Conference on Machine Learning, PMLR 139:4214-4226, 2021.

Abstract

We propose a novel policy update that combines regularized policy optimization with model learning as an auxiliary loss. The update (henceforth Muesli) matches MuZero’s state-of-the-art performance on Atari. Notably, Muesli does so without using deep search: it acts directly with a policy network and has computation speed comparable to model-free baselines. The Atari results are complemented by extensive ablations, and by additional results on continuous control and 9x9 Go.

Cite this Paper

BibTeX

@InProceedings{pmlr-v139-hessel21a,
  title = 	 {Muesli: Combining Improvements in Policy Optimization},
  author =       {Hessel, Matteo and Danihelka, Ivo and Viola, Fabio and Guez, Arthur and Schmitt, Simon and Sifre, Laurent and Weber, Theophane and Silver, David and Van Hasselt, Hado},
  booktitle = 	 {Proceedings of the 38th International Conference on Machine Learning},
  pages = 	 {4214--4226},
  year = 	 {2021},
  editor = 	 {Meila, Marina and Zhang, Tong},
  volume = 	 {139},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {18--24 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v139/hessel21a/hessel21a.pdf},
  url = 	 {https://proceedings.mlr.press/v139/hessel21a.html},
  abstract = 	 {We propose a novel policy update that combines regularized policy optimization with model learning as an auxiliary loss. The update (henceforth Muesli) matches MuZero’s state-of-the-art performance on Atari. Notably, Muesli does so without using deep search: it acts directly with a policy network and has computation speed comparable to model-free baselines. The Atari results are complemented by extensive ablations, and by additional results on continuous control and 9x9 Go.}
}

Endnote

%0 Conference Paper
%T Muesli: Combining Improvements in Policy Optimization
%A Matteo Hessel
%A Ivo Danihelka
%A Fabio Viola
%A Arthur Guez
%A Simon Schmitt
%A Laurent Sifre
%A Theophane Weber
%A David Silver
%A Hado Van Hasselt
%B Proceedings of the 38th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2021
%E Marina Meila
%E Tong Zhang	
%F pmlr-v139-hessel21a
%I PMLR
%P 4214--4226
%U https://proceedings.mlr.press/v139/hessel21a.html
%V 139
%X We propose a novel policy update that combines regularized policy optimization with model learning as an auxiliary loss. The update (henceforth Muesli) matches MuZero’s state-of-the-art performance on Atari. Notably, Muesli does so without using deep search: it acts directly with a policy network and has computation speed comparable to model-free baselines. The Atari results are complemented by extensive ablations, and by additional results on continuous control and 9x9 Go.

APA

Hessel, M., Danihelka, I., Viola, F., Guez, A., Schmitt, S., Sifre, L., Weber, T., Silver, D. & Van Hasselt, H.. (2021). Muesli: Combining Improvements in Policy Optimization. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:4214-4226 Available from https://proceedings.mlr.press/v139/hessel21a.html.

Muesli: Combining Improvements in Policy Optimization

Abstract

Cite this Paper

Related Material