True Online TD(lambda)

Harm Seijen; Rich Sutton

True Online TD(lambda)

Harm Seijen, Rich Sutton

Proceedings of the 31st International Conference on Machine Learning, PMLR 32(1):692-700, 2014.

Abstract

TD(lambda) is a core algorithm of modern reinforcement learning. Its appeal comes from its equivalence to a clear and conceptually simple forward view, and the fact that it can be implemented online in an inexpensive manner. However, the equivalence between TD(lambda) and the forward view is exact only for the off-line version of the algorithm (in which updates are made only at the end of each episode). In the online version of TD(lambda) (in which updates are made at each step, which generally performs better and is always used in applications) the match to the forward view is only approximate. In a sense this is unavoidable for the conventional forward view, as it itself presumes that the estimates are unchanging during an episode. In this paper we introduce a new forward view that takes into account the possibility of changing estimates and a new variant of TD(lambda) that exactly achieves it. Our algorithm uses a new form of eligibility trace similar to but different from conventional accumulating and replacing traces. The overall computational complexity is the same as TD(lambda), even when using function approximation. In our empirical comparisons, our algorithm outperformed TD(lambda) in all of its variations. It seems, by adhering more truly to the original goal of TD(lambda)—matching an intuitively clear forward view even in the online case—that we have found a new algorithm that simply improves on classical TD(lambda).

Cite this Paper

BibTeX


@InProceedings{pmlr-v32-seijen14,
  title = 	 {True Online TD(lambda)},
  author = 	 {Seijen, Harm and Sutton, Rich},
  booktitle = 	 {Proceedings of the 31st International Conference on Machine Learning},
  pages = 	 {692--700},
  year = 	 {2014},
  editor = 	 {Xing, Eric P. and Jebara, Tony},
  volume = 	 {32},
  number =       {1},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {Bejing, China},
  month = 	 {22--24 Jun},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v32/seijen14.pdf},
  url = 	 {https://proceedings.mlr.press/v32/seijen14.html},
  abstract = 	 {TD(lambda) is a core algorithm of modern reinforcement learning. Its appeal comes from its equivalence to a clear and conceptually simple forward view, and the fact that it can be implemented online in an inexpensive manner. However, the equivalence between TD(lambda) and the forward view is exact only for the off-line version of the algorithm (in which updates are made only at the end of each episode). In the online version of TD(lambda) (in which updates are made at each step, which generally performs better and is always used in applications) the match to the forward view is only approximate. In a sense this is unavoidable for the conventional forward view, as it itself presumes that the estimates are unchanging during an episode. In this paper we introduce a new forward view that takes into account the possibility of changing estimates and a new variant of TD(lambda) that exactly achieves it. Our algorithm uses a new form of eligibility trace similar to but different from conventional accumulating and replacing traces. The overall computational complexity is the same as TD(lambda), even when using function approximation. In our empirical comparisons, our algorithm outperformed TD(lambda) in all of its variations. It seems, by adhering more truly to the original goal of TD(lambda)—matching an intuitively clear forward view even in the online case—that we have found a new algorithm that simply improves on classical TD(lambda).}
}

Endnote

%0 Conference Paper
%T True Online TD(lambda)
%A Harm Seijen
%A Rich Sutton
%B Proceedings of the 31st International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2014
%E Eric P. Xing
%E Tony Jebara	
%F pmlr-v32-seijen14
%I PMLR
%P 692--700
%U https://proceedings.mlr.press/v32/seijen14.html
%V 32
%N 1
%X TD(lambda) is a core algorithm of modern reinforcement learning. Its appeal comes from its equivalence to a clear and conceptually simple forward view, and the fact that it can be implemented online in an inexpensive manner. However, the equivalence between TD(lambda) and the forward view is exact only for the off-line version of the algorithm (in which updates are made only at the end of each episode). In the online version of TD(lambda) (in which updates are made at each step, which generally performs better and is always used in applications) the match to the forward view is only approximate. In a sense this is unavoidable for the conventional forward view, as it itself presumes that the estimates are unchanging during an episode. In this paper we introduce a new forward view that takes into account the possibility of changing estimates and a new variant of TD(lambda) that exactly achieves it. Our algorithm uses a new form of eligibility trace similar to but different from conventional accumulating and replacing traces. The overall computational complexity is the same as TD(lambda), even when using function approximation. In our empirical comparisons, our algorithm outperformed TD(lambda) in all of its variations. It seems, by adhering more truly to the original goal of TD(lambda)—matching an intuitively clear forward view even in the online case—that we have found a new algorithm that simply improves on classical TD(lambda).

RIS


TY  - CPAPER
TI  - True Online TD(lambda)
AU  - Harm Seijen
AU  - Rich Sutton
BT  - Proceedings of the 31st International Conference on Machine Learning
DA  - 2014/01/27
ED  - Eric P. Xing
ED  - Tony Jebara	
ID  - pmlr-v32-seijen14
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 32
IS  - 1
SP  - 692
EP  - 700
L1  - http://proceedings.mlr.press/v32/seijen14.pdf
UR  - https://proceedings.mlr.press/v32/seijen14.html
AB  - TD(lambda) is a core algorithm of modern reinforcement learning. Its appeal comes from its equivalence to a clear and conceptually simple forward view, and the fact that it can be implemented online in an inexpensive manner. However, the equivalence between TD(lambda) and the forward view is exact only for the off-line version of the algorithm (in which updates are made only at the end of each episode). In the online version of TD(lambda) (in which updates are made at each step, which generally performs better and is always used in applications) the match to the forward view is only approximate. In a sense this is unavoidable for the conventional forward view, as it itself presumes that the estimates are unchanging during an episode. In this paper we introduce a new forward view that takes into account the possibility of changing estimates and a new variant of TD(lambda) that exactly achieves it. Our algorithm uses a new form of eligibility trace similar to but different from conventional accumulating and replacing traces. The overall computational complexity is the same as TD(lambda), even when using function approximation. In our empirical comparisons, our algorithm outperformed TD(lambda) in all of its variations. It seems, by adhering more truly to the original goal of TD(lambda)—matching an intuitively clear forward view even in the online case—that we have found a new algorithm that simply improves on classical TD(lambda).
ER  -

APA


Seijen, H. & Sutton, R.. (2014). True Online TD(lambda). Proceedings of the 31st International Conference on Machine Learning, in Proceedings of Machine Learning Research 32(1):692-700 Available from https://proceedings.mlr.press/v32/seijen14.html.

Related Material

Download PDF