Bias in Natural Actor-Critic Algorithms

Philip Thomas

Bias in Natural Actor-Critic Algorithms

Philip Thomas

Proceedings of the 31st International Conference on Machine Learning, PMLR 32(1):441-448, 2014.

Abstract

We show that several popular discounted reward natural actor-critics, including the popular NAC-LSTD and eNAC algorithms, do not generate unbiased estimates of the natural policy gradient as claimed. We derive the first unbiased discounted reward natural actor-critics using batch and iterative approaches to gradient estimation. We argue that the bias makes the existing algorithms more appropriate for the average reward setting. We also show that, when Sarsa(lambda) is guaranteed to converge to an optimal policy, the objective function used by natural actor-critics is concave, so policy gradient methods are guaranteed to converge to globally optimal policies as well.

Cite this Paper

BibTeX


@InProceedings{pmlr-v32-thomas14,
  title = 	 {Bias in Natural Actor-Critic Algorithms},
  author = 	 {Thomas, Philip},
  booktitle = 	 {Proceedings of the 31st International Conference on Machine Learning},
  pages = 	 {441--448},
  year = 	 {2014},
  editor = 	 {Xing, Eric P. and Jebara, Tony},
  volume = 	 {32},
  number =       {1},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {Bejing, China},
  month = 	 {22--24 Jun},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v32/thomas14.pdf},
  url = 	 {https://proceedings.mlr.press/v32/thomas14.html},
  abstract = 	 {We show that several popular discounted reward natural actor-critics, including the popular NAC-LSTD and eNAC algorithms, do not generate unbiased estimates of the natural policy gradient as claimed. We derive the first unbiased discounted reward natural actor-critics using batch and iterative approaches to gradient estimation. We argue that the bias makes the existing algorithms more appropriate for the average reward setting. We also show that, when Sarsa(lambda) is guaranteed to converge to an optimal policy, the objective function used by natural actor-critics is concave, so policy gradient methods are guaranteed to converge to globally optimal policies as well.}
}

Endnote

%0 Conference Paper
%T Bias in Natural Actor-Critic Algorithms
%A Philip Thomas
%B Proceedings of the 31st International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2014
%E Eric P. Xing
%E Tony Jebara	
%F pmlr-v32-thomas14
%I PMLR
%P 441--448
%U https://proceedings.mlr.press/v32/thomas14.html
%V 32
%N 1
%X We show that several popular discounted reward natural actor-critics, including the popular NAC-LSTD and eNAC algorithms, do not generate unbiased estimates of the natural policy gradient as claimed. We derive the first unbiased discounted reward natural actor-critics using batch and iterative approaches to gradient estimation. We argue that the bias makes the existing algorithms more appropriate for the average reward setting. We also show that, when Sarsa(lambda) is guaranteed to converge to an optimal policy, the objective function used by natural actor-critics is concave, so policy gradient methods are guaranteed to converge to globally optimal policies as well.

RIS


TY  - CPAPER
TI  - Bias in Natural Actor-Critic Algorithms
AU  - Philip Thomas
BT  - Proceedings of the 31st International Conference on Machine Learning
DA  - 2014/01/27
ED  - Eric P. Xing
ED  - Tony Jebara	
ID  - pmlr-v32-thomas14
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 32
IS  - 1
SP  - 441
EP  - 448
L1  - http://proceedings.mlr.press/v32/thomas14.pdf
UR  - https://proceedings.mlr.press/v32/thomas14.html
AB  - We show that several popular discounted reward natural actor-critics, including the popular NAC-LSTD and eNAC algorithms, do not generate unbiased estimates of the natural policy gradient as claimed. We derive the first unbiased discounted reward natural actor-critics using batch and iterative approaches to gradient estimation. We argue that the bias makes the existing algorithms more appropriate for the average reward setting. We also show that, when Sarsa(lambda) is guaranteed to converge to an optimal policy, the objective function used by natural actor-critics is concave, so policy gradient methods are guaranteed to converge to globally optimal policies as well.
ER  -

APA


Thomas, P.. (2014). Bias in Natural Actor-Critic Algorithms. Proceedings of the 31st International Conference on Machine Learning, in Proceedings of Machine Learning Research 32(1):441-448 Available from https://proceedings.mlr.press/v32/thomas14.html.

Related Material

Download PDF