Data-efficient Hindsight Off-policy Option Learning

Markus Wulfmeier, Dushyant Rao, Roland Hafner, Thomas Lampe, Abbas Abdolmaleki, Tim Hertweck, Michael Neunert, Dhruva Tirumala, Noah Siegel, Nicolas Heess, Martin Riedmiller
Proceedings of the 38th International Conference on Machine Learning, PMLR 139:11340-11350, 2021.

Abstract

We introduce Hindsight Off-policy Options (HO2), a data-efficient option learning algorithm. Given any trajectory, HO2 infers likely option choices and backpropagates through the dynamic programming inference procedure to robustly train all policy components off-policy and end-to-end. The approach outperforms existing option learning methods on common benchmarks. To better understand the option framework and disentangle benefits from both temporal and action abstraction, we evaluate ablations with flat policies and mixture policies with comparable optimization. The results highlight the importance of both types of abstraction as well as off-policy training and trust-region constraints, particularly in challenging, simulated 3D robot manipulation tasks from raw pixel inputs. Finally, we intuitively adapt the inference step to investigate the effect of increased temporal abstraction on training with pre-trained options and from scratch.

Cite this Paper


BibTeX
@InProceedings{pmlr-v139-wulfmeier21a, title = {Data-efficient Hindsight Off-policy Option Learning}, author = {Wulfmeier, Markus and Rao, Dushyant and Hafner, Roland and Lampe, Thomas and Abdolmaleki, Abbas and Hertweck, Tim and Neunert, Michael and Tirumala, Dhruva and Siegel, Noah and Heess, Nicolas and Riedmiller, Martin}, booktitle = {Proceedings of the 38th International Conference on Machine Learning}, pages = {11340--11350}, year = {2021}, editor = {Meila, Marina and Zhang, Tong}, volume = {139}, series = {Proceedings of Machine Learning Research}, month = {18--24 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v139/wulfmeier21a/wulfmeier21a.pdf}, url = {https://proceedings.mlr.press/v139/wulfmeier21a.html}, abstract = {We introduce Hindsight Off-policy Options (HO2), a data-efficient option learning algorithm. Given any trajectory, HO2 infers likely option choices and backpropagates through the dynamic programming inference procedure to robustly train all policy components off-policy and end-to-end. The approach outperforms existing option learning methods on common benchmarks. To better understand the option framework and disentangle benefits from both temporal and action abstraction, we evaluate ablations with flat policies and mixture policies with comparable optimization. The results highlight the importance of both types of abstraction as well as off-policy training and trust-region constraints, particularly in challenging, simulated 3D robot manipulation tasks from raw pixel inputs. Finally, we intuitively adapt the inference step to investigate the effect of increased temporal abstraction on training with pre-trained options and from scratch.} }
Endnote
%0 Conference Paper %T Data-efficient Hindsight Off-policy Option Learning %A Markus Wulfmeier %A Dushyant Rao %A Roland Hafner %A Thomas Lampe %A Abbas Abdolmaleki %A Tim Hertweck %A Michael Neunert %A Dhruva Tirumala %A Noah Siegel %A Nicolas Heess %A Martin Riedmiller %B Proceedings of the 38th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2021 %E Marina Meila %E Tong Zhang %F pmlr-v139-wulfmeier21a %I PMLR %P 11340--11350 %U https://proceedings.mlr.press/v139/wulfmeier21a.html %V 139 %X We introduce Hindsight Off-policy Options (HO2), a data-efficient option learning algorithm. Given any trajectory, HO2 infers likely option choices and backpropagates through the dynamic programming inference procedure to robustly train all policy components off-policy and end-to-end. The approach outperforms existing option learning methods on common benchmarks. To better understand the option framework and disentangle benefits from both temporal and action abstraction, we evaluate ablations with flat policies and mixture policies with comparable optimization. The results highlight the importance of both types of abstraction as well as off-policy training and trust-region constraints, particularly in challenging, simulated 3D robot manipulation tasks from raw pixel inputs. Finally, we intuitively adapt the inference step to investigate the effect of increased temporal abstraction on training with pre-trained options and from scratch.
APA
Wulfmeier, M., Rao, D., Hafner, R., Lampe, T., Abdolmaleki, A., Hertweck, T., Neunert, M., Tirumala, D., Siegel, N., Heess, N. & Riedmiller, M.. (2021). Data-efficient Hindsight Off-policy Option Learning. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:11340-11350 Available from https://proceedings.mlr.press/v139/wulfmeier21a.html.

Related Material