Muesli: Combining Improvements in Policy Optimization

Matteo Hessel, Ivo Danihelka, Fabio Viola, Arthur Guez, Simon Schmitt, Laurent Sifre, Theophane Weber, David Silver, Hado Van Hasselt
Proceedings of the 38th International Conference on Machine Learning, PMLR 139:4214-4226, 2021.

Abstract

We propose a novel policy update that combines regularized policy optimization with model learning as an auxiliary loss. The update (henceforth Muesli) matches MuZero’s state-of-the-art performance on Atari. Notably, Muesli does so without using deep search: it acts directly with a policy network and has computation speed comparable to model-free baselines. The Atari results are complemented by extensive ablations, and by additional results on continuous control and 9x9 Go.

Cite this Paper


BibTeX
@InProceedings{pmlr-v139-hessel21a, title = {Muesli: Combining Improvements in Policy Optimization}, author = {Hessel, Matteo and Danihelka, Ivo and Viola, Fabio and Guez, Arthur and Schmitt, Simon and Sifre, Laurent and Weber, Theophane and Silver, David and Van Hasselt, Hado}, booktitle = {Proceedings of the 38th International Conference on Machine Learning}, pages = {4214--4226}, year = {2021}, editor = {Meila, Marina and Zhang, Tong}, volume = {139}, series = {Proceedings of Machine Learning Research}, month = {18--24 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v139/hessel21a/hessel21a.pdf}, url = {https://proceedings.mlr.press/v139/hessel21a.html}, abstract = {We propose a novel policy update that combines regularized policy optimization with model learning as an auxiliary loss. The update (henceforth Muesli) matches MuZero’s state-of-the-art performance on Atari. Notably, Muesli does so without using deep search: it acts directly with a policy network and has computation speed comparable to model-free baselines. The Atari results are complemented by extensive ablations, and by additional results on continuous control and 9x9 Go.} }
Endnote
%0 Conference Paper %T Muesli: Combining Improvements in Policy Optimization %A Matteo Hessel %A Ivo Danihelka %A Fabio Viola %A Arthur Guez %A Simon Schmitt %A Laurent Sifre %A Theophane Weber %A David Silver %A Hado Van Hasselt %B Proceedings of the 38th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2021 %E Marina Meila %E Tong Zhang %F pmlr-v139-hessel21a %I PMLR %P 4214--4226 %U https://proceedings.mlr.press/v139/hessel21a.html %V 139 %X We propose a novel policy update that combines regularized policy optimization with model learning as an auxiliary loss. The update (henceforth Muesli) matches MuZero’s state-of-the-art performance on Atari. Notably, Muesli does so without using deep search: it acts directly with a policy network and has computation speed comparable to model-free baselines. The Atari results are complemented by extensive ablations, and by additional results on continuous control and 9x9 Go.
APA
Hessel, M., Danihelka, I., Viola, F., Guez, A., Schmitt, S., Sifre, L., Weber, T., Silver, D. & Van Hasselt, H.. (2021). Muesli: Combining Improvements in Policy Optimization. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:4214-4226 Available from https://proceedings.mlr.press/v139/hessel21a.html.

Related Material