A Temporal-Difference Approach to Policy Gradient Estimation

Samuele Tosatto; Andrew Patterson; Martha White; Rupam Mahmood

A Temporal-Difference Approach to Policy Gradient Estimation

Samuele Tosatto, Andrew Patterson, Martha White, Rupam Mahmood

Proceedings of the 39th International Conference on Machine Learning, PMLR 162:21609-21632, 2022.

Abstract

The policy gradient theorem (Sutton et al., 2000) prescribes the usage of a cumulative discounted state distribution under the target policy to approximate the gradient. Most algorithms based on this theorem, in practice, break this assumption, introducing a distribution shift that can cause the convergence to poor solutions. In this paper, we propose a new approach of reconstructing the policy gradient from the start state without requiring a particular sampling strategy. The policy gradient calculation in this form can be simplified in terms of a gradient critic, which can be recursively estimated due to a new Bellman equation of gradients. By using temporal-difference updates of the gradient critic from an off-policy data stream, we develop the first estimator that side-steps the distribution shift issue in a model-free way. We prove that, under certain realizability conditions, our estimator is unbiased regardless of the sampling strategy. We empirically show that our technique achieves a superior bias-variance trade-off and performance in presence of off-policy samples.

Cite this Paper

BibTeX

@InProceedings{pmlr-v162-tosatto22a,
  title = 	 {A Temporal-Difference Approach to Policy Gradient Estimation},
  author =       {Tosatto, Samuele and Patterson, Andrew and White, Martha and Mahmood, Rupam},
  booktitle = 	 {Proceedings of the 39th International Conference on Machine Learning},
  pages = 	 {21609--21632},
  year = 	 {2022},
  editor = 	 {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan},
  volume = 	 {162},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {17--23 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v162/tosatto22a/tosatto22a.pdf},
  url = 	 {https://proceedings.mlr.press/v162/tosatto22a.html},
  abstract = 	 {The policy gradient theorem (Sutton et al., 2000) prescribes the usage of a cumulative discounted state distribution under the target policy to approximate the gradient. Most algorithms based on this theorem, in practice, break this assumption, introducing a distribution shift that can cause the convergence to poor solutions. In this paper, we propose a new approach of reconstructing the policy gradient from the start state without requiring a particular sampling strategy. The policy gradient calculation in this form can be simplified in terms of a gradient critic, which can be recursively estimated due to a new Bellman equation of gradients. By using temporal-difference updates of the gradient critic from an off-policy data stream, we develop the first estimator that side-steps the distribution shift issue in a model-free way. We prove that, under certain realizability conditions, our estimator is unbiased regardless of the sampling strategy. We empirically show that our technique achieves a superior bias-variance trade-off and performance in presence of off-policy samples.}
}

Endnote

%0 Conference Paper
%T A Temporal-Difference Approach to Policy Gradient Estimation
%A Samuele Tosatto
%A Andrew Patterson
%A Martha White
%A Rupam Mahmood
%B Proceedings of the 39th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2022
%E Kamalika Chaudhuri
%E Stefanie Jegelka
%E Le Song
%E Csaba Szepesvari
%E Gang Niu
%E Sivan Sabato	
%F pmlr-v162-tosatto22a
%I PMLR
%P 21609--21632
%U https://proceedings.mlr.press/v162/tosatto22a.html
%V 162
%X The policy gradient theorem (Sutton et al., 2000) prescribes the usage of a cumulative discounted state distribution under the target policy to approximate the gradient. Most algorithms based on this theorem, in practice, break this assumption, introducing a distribution shift that can cause the convergence to poor solutions. In this paper, we propose a new approach of reconstructing the policy gradient from the start state without requiring a particular sampling strategy. The policy gradient calculation in this form can be simplified in terms of a gradient critic, which can be recursively estimated due to a new Bellman equation of gradients. By using temporal-difference updates of the gradient critic from an off-policy data stream, we develop the first estimator that side-steps the distribution shift issue in a model-free way. We prove that, under certain realizability conditions, our estimator is unbiased regardless of the sampling strategy. We empirically show that our technique achieves a superior bias-variance trade-off and performance in presence of off-policy samples.

APA

Tosatto, S., Patterson, A., White, M. & Mahmood, R.. (2022). A Temporal-Difference Approach to Policy Gradient Estimation. Proceedings of the 39th International Conference on Machine Learning, in Proceedings of Machine Learning Research 162:21609-21632 Available from https://proceedings.mlr.press/v162/tosatto22a.html.

Related Material

Download PDF