Deterministic Policy Gradient Algorithms

David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, Martin Riedmiller
; Proceedings of the 31st International Conference on Machine Learning, PMLR 32(1):387-395, 2014.

Abstract

In this paper we consider deterministic policy gradient algorithms for reinforcement learning with continuous actions. The deterministic policy gradient has a particularly appealing form: it is the expected gradient of the action-value function. This simple form means that the deterministic policy gradient can be estimated much more efficiently than the usual stochastic policy gradient. To ensure adequate exploration, we introduce an off-policy actor-critic algorithm that learns a deterministic target policy from an exploratory behaviour policy. Deterministic policy gradient algorithms outperformed their stochastic counterparts in several benchmark problems, particularly in high-dimensional action spaces.

Cite this Paper


BibTeX
@InProceedings{pmlr-v32-silver14, title = {Deterministic Policy Gradient Algorithms}, author = {David Silver and Guy Lever and Nicolas Heess and Thomas Degris and Daan Wierstra and Martin Riedmiller}, booktitle = {Proceedings of the 31st International Conference on Machine Learning}, pages = {387--395}, year = {2014}, editor = {Eric P. Xing and Tony Jebara}, volume = {32}, number = {1}, series = {Proceedings of Machine Learning Research}, address = {Bejing, China}, month = {22--24 Jun}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v32/silver14.pdf}, url = {http://proceedings.mlr.press/v32/silver14.html}, abstract = {In this paper we consider deterministic policy gradient algorithms for reinforcement learning with continuous actions. The deterministic policy gradient has a particularly appealing form: it is the expected gradient of the action-value function. This simple form means that the deterministic policy gradient can be estimated much more efficiently than the usual stochastic policy gradient. To ensure adequate exploration, we introduce an off-policy actor-critic algorithm that learns a deterministic target policy from an exploratory behaviour policy. Deterministic policy gradient algorithms outperformed their stochastic counterparts in several benchmark problems, particularly in high-dimensional action spaces.} }
Endnote
%0 Conference Paper %T Deterministic Policy Gradient Algorithms %A David Silver %A Guy Lever %A Nicolas Heess %A Thomas Degris %A Daan Wierstra %A Martin Riedmiller %B Proceedings of the 31st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2014 %E Eric P. Xing %E Tony Jebara %F pmlr-v32-silver14 %I PMLR %J Proceedings of Machine Learning Research %P 387--395 %U http://proceedings.mlr.press %V 32 %N 1 %W PMLR %X In this paper we consider deterministic policy gradient algorithms for reinforcement learning with continuous actions. The deterministic policy gradient has a particularly appealing form: it is the expected gradient of the action-value function. This simple form means that the deterministic policy gradient can be estimated much more efficiently than the usual stochastic policy gradient. To ensure adequate exploration, we introduce an off-policy actor-critic algorithm that learns a deterministic target policy from an exploratory behaviour policy. Deterministic policy gradient algorithms outperformed their stochastic counterparts in several benchmark problems, particularly in high-dimensional action spaces.
RIS
TY - CPAPER TI - Deterministic Policy Gradient Algorithms AU - David Silver AU - Guy Lever AU - Nicolas Heess AU - Thomas Degris AU - Daan Wierstra AU - Martin Riedmiller BT - Proceedings of the 31st International Conference on Machine Learning PY - 2014/01/27 DA - 2014/01/27 ED - Eric P. Xing ED - Tony Jebara ID - pmlr-v32-silver14 PB - PMLR SP - 387 DP - PMLR EP - 395 L1 - http://proceedings.mlr.press/v32/silver14.pdf UR - http://proceedings.mlr.press/v32/silver14.html AB - In this paper we consider deterministic policy gradient algorithms for reinforcement learning with continuous actions. The deterministic policy gradient has a particularly appealing form: it is the expected gradient of the action-value function. This simple form means that the deterministic policy gradient can be estimated much more efficiently than the usual stochastic policy gradient. To ensure adequate exploration, we introduce an off-policy actor-critic algorithm that learns a deterministic target policy from an exploratory behaviour policy. Deterministic policy gradient algorithms outperformed their stochastic counterparts in several benchmark problems, particularly in high-dimensional action spaces. ER -
APA
Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D. & Riedmiller, M.. (2014). Deterministic Policy Gradient Algorithms. Proceedings of the 31st International Conference on Machine Learning, in PMLR 32(1):387-395

Related Material