Temporal Difference Learning as Gradient Splitting

Rui Liu; Alex Olshevsky

Temporal Difference Learning as Gradient Splitting

Rui Liu, Alex Olshevsky

Proceedings of the 38th International Conference on Machine Learning, PMLR 139:6905-6913, 2021.

Abstract

Temporal difference learning with linear function approximation is a popular method to obtain a low-dimensional approximation of the value function of a policy in a Markov Decision Process. We provide an interpretation of this method in terms of a splitting of the gradient of an appropriately chosen function. As a consequence of this interpretation, convergence proofs for gradient descent can be applied almost verbatim to temporal difference learning. Beyond giving a fuller explanation of why temporal difference works, this interpretation also yields improved convergence times. We consider the setting with $1/\sqrt{T}$ step-size, where previous comparable finite-time convergence time bounds for temporal difference learning had the multiplicative factor $1/(1-\gamma)$ in front of the bound, with $\gamma$ being the discount factor. We show that a minor variation on TD learning which estimates the mean of the value function separately has a convergence time where $1/(1-\gamma)$ only multiplies an asymptotically negligible term.

Cite this Paper

BibTeX

@InProceedings{pmlr-v139-liu21q,
  title = 	 {Temporal Difference Learning as Gradient Splitting},
  author =       {Liu, Rui and Olshevsky, Alex},
  booktitle = 	 {Proceedings of the 38th International Conference on Machine Learning},
  pages = 	 {6905--6913},
  year = 	 {2021},
  editor = 	 {Meila, Marina and Zhang, Tong},
  volume = 	 {139},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {18--24 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v139/liu21q/liu21q.pdf},
  url = 	 {https://proceedings.mlr.press/v139/liu21q.html},
  abstract = 	 {Temporal difference learning with linear function approximation is a popular method to obtain a low-dimensional approximation of the value function of a policy in a Markov Decision Process. We provide an interpretation of this method in terms of a splitting of the gradient of an appropriately chosen function. As a consequence of this interpretation, convergence proofs for gradient descent can be applied almost verbatim to temporal difference learning. Beyond giving a fuller explanation of why temporal difference works, this interpretation also yields improved convergence times. We consider the setting with $1/\sqrt{T}$ step-size, where previous comparable finite-time convergence time bounds for temporal difference learning had the multiplicative factor $1/(1-\gamma)$ in front of the bound, with $\gamma$ being the discount factor. We show that a minor variation on TD learning which estimates the mean of the value function separately has a convergence time where $1/(1-\gamma)$ only multiplies an asymptotically negligible term.}
}

Endnote

%0 Conference Paper
%T Temporal Difference Learning as Gradient Splitting
%A Rui Liu
%A Alex Olshevsky
%B Proceedings of the 38th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2021
%E Marina Meila
%E Tong Zhang	
%F pmlr-v139-liu21q
%I PMLR
%P 6905--6913
%U https://proceedings.mlr.press/v139/liu21q.html
%V 139
%X Temporal difference learning with linear function approximation is a popular method to obtain a low-dimensional approximation of the value function of a policy in a Markov Decision Process. We provide an interpretation of this method in terms of a splitting of the gradient of an appropriately chosen function. As a consequence of this interpretation, convergence proofs for gradient descent can be applied almost verbatim to temporal difference learning. Beyond giving a fuller explanation of why temporal difference works, this interpretation also yields improved convergence times. We consider the setting with $1/\sqrt{T}$ step-size, where previous comparable finite-time convergence time bounds for temporal difference learning had the multiplicative factor $1/(1-\gamma)$ in front of the bound, with $\gamma$ being the discount factor. We show that a minor variation on TD learning which estimates the mean of the value function separately has a convergence time where $1/(1-\gamma)$ only multiplies an asymptotically negligible term.

APA

Liu, R. & Olshevsky, A.. (2021). Temporal Difference Learning as Gradient Splitting. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:6905-6913 Available from https://proceedings.mlr.press/v139/liu21q.html.

Temporal Difference Learning as Gradient Splitting

Abstract

Cite this Paper

Related Material