Actor-Critic Reinforcement Learning with Energy-Based Policies

Nicolas Heess, David Silver, Yee Whye Teh
Proceedings of the Tenth European Workshop on Reinforcement Learning, PMLR 24:45-58, 2013.

Abstract

We consider reinforcement learning in Markov decision processes with high dimensional state and action spaces. We parametrize policies using energy-based models (particularly restricted Boltzmann machines), and train them using policy gradient learning. Our approach builds upon Sallans and Hinton (2004), who parameterized value functions using energy-based models, trained using a non-linear variant of temporal-difference (TD) learning. Unfortunately, non-linear TD is known to diverge in theory and practice. We introduce the first sound and efficient algorithm for training energy-based policies, based on an actor-critic architecture. Our algorithm is computationally efficient, converges close to a local optimum, and outperforms Sallans and Hinton (2004) in several high dimensional domains.

Cite this Paper


BibTeX
@InProceedings{pmlr-v24-heess12a, title = {Actor-Critic Reinforcement Learning with Energy-Based Policies}, author = {Heess, Nicolas and Silver, David and Teh, Yee Whye}, booktitle = {Proceedings of the Tenth European Workshop on Reinforcement Learning}, pages = {45--58}, year = {2013}, editor = {Deisenroth, Marc Peter and Szepesvári, Csaba and Peters, Jan}, volume = {24}, series = {Proceedings of Machine Learning Research}, address = {Edinburgh, Scotland}, month = {30 Jun--01 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v24/heess12a/heess12a.pdf}, url = {https://proceedings.mlr.press/v24/heess12a.html}, abstract = {We consider reinforcement learning in Markov decision processes with high dimensional state and action spaces. We parametrize policies using energy-based models (particularly restricted Boltzmann machines), and train them using policy gradient learning. Our approach builds upon Sallans and Hinton (2004), who parameterized value functions using energy-based models, trained using a non-linear variant of temporal-difference (TD) learning. Unfortunately, non-linear TD is known to diverge in theory and practice. We introduce the first sound and efficient algorithm for training energy-based policies, based on an actor-critic architecture. Our algorithm is computationally efficient, converges close to a local optimum, and outperforms Sallans and Hinton (2004) in several high dimensional domains.} }
Endnote
%0 Conference Paper %T Actor-Critic Reinforcement Learning with Energy-Based Policies %A Nicolas Heess %A David Silver %A Yee Whye Teh %B Proceedings of the Tenth European Workshop on Reinforcement Learning %C Proceedings of Machine Learning Research %D 2013 %E Marc Peter Deisenroth %E Csaba Szepesvári %E Jan Peters %F pmlr-v24-heess12a %I PMLR %P 45--58 %U https://proceedings.mlr.press/v24/heess12a.html %V 24 %X We consider reinforcement learning in Markov decision processes with high dimensional state and action spaces. We parametrize policies using energy-based models (particularly restricted Boltzmann machines), and train them using policy gradient learning. Our approach builds upon Sallans and Hinton (2004), who parameterized value functions using energy-based models, trained using a non-linear variant of temporal-difference (TD) learning. Unfortunately, non-linear TD is known to diverge in theory and practice. We introduce the first sound and efficient algorithm for training energy-based policies, based on an actor-critic architecture. Our algorithm is computationally efficient, converges close to a local optimum, and outperforms Sallans and Hinton (2004) in several high dimensional domains.
RIS
TY - CPAPER TI - Actor-Critic Reinforcement Learning with Energy-Based Policies AU - Nicolas Heess AU - David Silver AU - Yee Whye Teh BT - Proceedings of the Tenth European Workshop on Reinforcement Learning DA - 2013/01/12 ED - Marc Peter Deisenroth ED - Csaba Szepesvári ED - Jan Peters ID - pmlr-v24-heess12a PB - PMLR DP - Proceedings of Machine Learning Research VL - 24 SP - 45 EP - 58 L1 - http://proceedings.mlr.press/v24/heess12a/heess12a.pdf UR - https://proceedings.mlr.press/v24/heess12a.html AB - We consider reinforcement learning in Markov decision processes with high dimensional state and action spaces. We parametrize policies using energy-based models (particularly restricted Boltzmann machines), and train them using policy gradient learning. Our approach builds upon Sallans and Hinton (2004), who parameterized value functions using energy-based models, trained using a non-linear variant of temporal-difference (TD) learning. Unfortunately, non-linear TD is known to diverge in theory and practice. We introduce the first sound and efficient algorithm for training energy-based policies, based on an actor-critic architecture. Our algorithm is computationally efficient, converges close to a local optimum, and outperforms Sallans and Hinton (2004) in several high dimensional domains. ER -
APA
Heess, N., Silver, D. & Teh, Y.W.. (2013). Actor-Critic Reinforcement Learning with Energy-Based Policies. Proceedings of the Tenth European Workshop on Reinforcement Learning, in Proceedings of Machine Learning Research 24:45-58 Available from https://proceedings.mlr.press/v24/heess12a.html.

Related Material