Loosely consistent emphatic temporal-difference learning

Jiamin He; Fengdi Che; Yi Wan; A. Rupam Mahmood

Loosely consistent emphatic temporal-difference learning

Jiamin He, Fengdi Che, Yi Wan, A. Rupam Mahmood

Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence, PMLR 216:849-859, 2023.

Abstract

There has been significant interest in searching for off-policy Temporal-Difference (TD) algorithms that find the same solution that would have been obtained in the on-policy regime. An important property of such algorithms is that their expected update has the same fixed point as that of On-policy TD(

$\lambda$ ), which we call loose consistency. Notably, Full-IS-TD(

$\lambda$ ) is the only existing loosely consistent method under general linear function approximation but, unfortunately, has a high variance and is scarcely practical. This notorious high variance issue motivates the introduction of ETD(

$\lambda$ ), which tames down the variance but has a biased fixed point. Inspired by these two methods, we propose a new loosely consistent algorithm called Average Emphatic TD (AETD(

$\lambda$ )) with a transient bias, which strikes a balance between bias and variance. Further, we unify AETD(

$\lambda$ ) with existing methods and obtain a new family of loosely consistent algorithms called Loosely Consistent Emphatic TD (LC-ETD(

$\lambda$ ,

$\beta$ ,

$\nu$ )), which can control a smooth bias-variance trade-off by varying the speed at which the transient bias fades. Through experiments on illustrative examples, we show the effectiveness and practicality of LC-ETD(

$\lambda$ ,

$\beta$ ,

$\nu$ ).

Cite this Paper

BibTeX


@InProceedings{pmlr-v216-he23a,
  title = 	 {Loosely consistent emphatic temporal-difference learning},
  author =       {He, Jiamin and Che, Fengdi and Wan, Yi and Mahmood, A. Rupam},
  booktitle = 	 {Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence},
  pages = 	 {849--859},
  year = 	 {2023},
  editor = 	 {Evans, Robin J. and Shpitser, Ilya},
  volume = 	 {216},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {31 Jul--04 Aug},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v216/he23a/he23a.pdf},
  url = 	 {https://proceedings.mlr.press/v216/he23a.html},
  abstract = 	 {There has been significant interest in searching for off-policy Temporal-Difference (TD) algorithms that find the same solution that would have been obtained in the on-policy regime. An important property of such algorithms is that their expected update has the same fixed point as that of On-policy TD($\lambda$), which we call loose consistency. Notably, Full-IS-TD($\lambda$) is the only existing loosely consistent method under general linear function approximation but, unfortunately, has a high variance and is scarcely practical. This notorious high variance issue motivates the introduction of ETD($\lambda$), which tames down the variance but has a biased fixed point. Inspired by these two methods, we propose a new loosely consistent algorithm called Average Emphatic TD (AETD($\lambda$)) with a transient bias, which strikes a balance between bias and variance. Further, we unify AETD($\lambda$) with existing methods and obtain a new family of loosely consistent algorithms called Loosely Consistent Emphatic TD (LC-ETD($\lambda$, $\beta$, $\nu$)), which can control a smooth bias-variance trade-off by varying the speed at which the transient bias fades. Through experiments on illustrative examples, we show the effectiveness and practicality of LC-ETD($\lambda$, $\beta$, $\nu$).}
}

Endnote

%0 Conference Paper
%T Loosely consistent emphatic temporal-difference learning
%A Jiamin He
%A Fengdi Che
%A Yi Wan
%A A. Rupam Mahmood
%B Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence
%C Proceedings of Machine Learning Research
%D 2023
%E Robin J. Evans
%E Ilya Shpitser	
%F pmlr-v216-he23a
%I PMLR
%P 849--859
%U https://proceedings.mlr.press/v216/he23a.html
%V 216
%X There has been significant interest in searching for off-policy Temporal-Difference (TD) algorithms that find the same solution that would have been obtained in the on-policy regime. An important property of such algorithms is that their expected update has the same fixed point as that of On-policy TD($\lambda$), which we call loose consistency. Notably, Full-IS-TD($\lambda$) is the only existing loosely consistent method under general linear function approximation but, unfortunately, has a high variance and is scarcely practical. This notorious high variance issue motivates the introduction of ETD($\lambda$), which tames down the variance but has a biased fixed point. Inspired by these two methods, we propose a new loosely consistent algorithm called Average Emphatic TD (AETD($\lambda$)) with a transient bias, which strikes a balance between bias and variance. Further, we unify AETD($\lambda$) with existing methods and obtain a new family of loosely consistent algorithms called Loosely Consistent Emphatic TD (LC-ETD($\lambda$, $\beta$, $\nu$)), which can control a smooth bias-variance trade-off by varying the speed at which the transient bias fades. Through experiments on illustrative examples, we show the effectiveness and practicality of LC-ETD($\lambda$, $\beta$, $\nu$).

APA


He, J., Che, F., Wan, Y. & Mahmood, A.R.. (2023). Loosely consistent emphatic temporal-difference learning. Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence, in Proceedings of Machine Learning Research 216:849-859 Available from https://proceedings.mlr.press/v216/he23a.html.

Loosely consistent emphatic temporal-difference learning

Abstract

Cite this Paper

Related Material