Modified Retrace for Off-Policy Temporal Difference Learning

Xingguo Chen; Xingzhou Ma; Yang Li; Guang Yang; Shangdong Yang; Yang Gao

Modified Retrace for Off-Policy Temporal Difference Learning

Xingguo Chen, Xingzhou Ma, Yang Li, Guang Yang, Shangdong Yang, Yang Gao

Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence, PMLR 216:303-312, 2023.

Abstract

Off-policy learning is a key to extend reinforcement learning as it allows to learn a target policy from a different behavior policy that generates the data. However, it is well known as “the deadly triad” when combined with bootstrapping and function approximation. Retrace is an efficient and convergent off-policy algorithm with tabular value functions which employs truncated importance sampling ratios. Unfortunately, Retrace is known to be unstable with linear function approximation. In this paper, we propose modified Retrace to correct the off-policy return, derive a new off-policy temporal difference learning algorithm (TD-MRetrace) with linear function approximation, and obtain a convergence guarantee under standard assumptions. Experimental results on counterexamples and control tasks validate the effectiveness of the proposed algorithm compared with traditional algorithms.

Cite this Paper

BibTeX


@InProceedings{pmlr-v216-chen23a,
  title = 	 {Modified Retrace for Off-Policy Temporal Difference Learning},
  author =       {Chen, Xingguo and Ma, Xingzhou and Li, Yang and Yang, Guang and Yang, Shangdong and Gao, Yang},
  booktitle = 	 {Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence},
  pages = 	 {303--312},
  year = 	 {2023},
  editor = 	 {Evans, Robin J. and Shpitser, Ilya},
  volume = 	 {216},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {31 Jul--04 Aug},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v216/chen23a/chen23a.pdf},
  url = 	 {https://proceedings.mlr.press/v216/chen23a.html},
  abstract = 	 {Off-policy learning is a key to extend reinforcement learning as it allows to learn  a target policy from a different behavior policy that generates the data. However, it is well known as “the deadly triad” when combined with bootstrapping and function approximation. Retrace is an efficient and  convergent off-policy algorithm with tabular value functions which employs  truncated importance sampling ratios. Unfortunately, Retrace is known to be unstable with linear function approximation. In this paper, we propose modified Retrace  to correct the  off-policy return, derive a new off-policy temporal difference learning algorithm (TD-MRetrace) with linear function approximation, and obtain a convergence guarantee under standard assumptions. Experimental results on counterexamples and control tasks validate the effectiveness of the proposed algorithm compared with traditional algorithms.}
}

Endnote

%0 Conference Paper
%T Modified Retrace for Off-Policy Temporal Difference Learning
%A Xingguo Chen
%A Xingzhou Ma
%A Yang Li
%A Guang Yang
%A Shangdong Yang
%A Yang Gao
%B Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence
%C Proceedings of Machine Learning Research
%D 2023
%E Robin J. Evans
%E Ilya Shpitser	
%F pmlr-v216-chen23a
%I PMLR
%P 303--312
%U https://proceedings.mlr.press/v216/chen23a.html
%V 216
%X Off-policy learning is a key to extend reinforcement learning as it allows to learn  a target policy from a different behavior policy that generates the data. However, it is well known as “the deadly triad” when combined with bootstrapping and function approximation. Retrace is an efficient and  convergent off-policy algorithm with tabular value functions which employs  truncated importance sampling ratios. Unfortunately, Retrace is known to be unstable with linear function approximation. In this paper, we propose modified Retrace  to correct the  off-policy return, derive a new off-policy temporal difference learning algorithm (TD-MRetrace) with linear function approximation, and obtain a convergence guarantee under standard assumptions. Experimental results on counterexamples and control tasks validate the effectiveness of the proposed algorithm compared with traditional algorithms.

APA


Chen, X., Ma, X., Li, Y., Yang, G., Yang, S. & Gao, Y.. (2023). Modified Retrace for Off-Policy Temporal Difference Learning. Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence, in Proceedings of Machine Learning Research 216:303-312 Available from https://proceedings.mlr.press/v216/chen23a.html.

Modified Retrace for Off-Policy Temporal Difference Learning

Abstract

Cite this Paper

Related Material