Average-Reward Off-Policy Policy Evaluation with Function Approximation

Shangtong Zhang; Yi Wan; Richard S Sutton; Shimon Whiteson

Average-Reward Off-Policy Policy Evaluation with Function Approximation

Shangtong Zhang, Yi Wan, Richard S Sutton, Shimon Whiteson

Proceedings of the 38th International Conference on Machine Learning, PMLR 139:12578-12588, 2021.

Abstract

We consider off-policy policy evaluation with function approximation (FA) in average-reward MDPs, where the goal is to estimate both the reward rate and the differential value function. For this problem, bootstrapping is necessary and, along with off-policy learning and FA, results in the deadly triad (Sutton & Barto, 2018). To address the deadly triad, we propose two novel algorithms, reproducing the celebrated success of Gradient TD algorithms in the average-reward setting. In terms of estimating the differential value function, the algorithms are the first convergent off-policy linear function approximation algorithms. In terms of estimating the reward rate, the algorithms are the first convergent off-policy linear function approximation algorithms that do not require estimating the density ratio. We demonstrate empirically the advantage of the proposed algorithms, as well as their nonlinear variants, over a competitive density-ratio-based approach, in a simple domain as well as challenging robot simulation tasks.

Cite this Paper

BibTeX

@InProceedings{pmlr-v139-zhang21u,
  title = 	 {Average-Reward Off-Policy Policy Evaluation with Function Approximation},
  author =       {Zhang, Shangtong and Wan, Yi and Sutton, Richard S and Whiteson, Shimon},
  booktitle = 	 {Proceedings of the 38th International Conference on Machine Learning},
  pages = 	 {12578--12588},
  year = 	 {2021},
  editor = 	 {Meila, Marina and Zhang, Tong},
  volume = 	 {139},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {18--24 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v139/zhang21u/zhang21u.pdf},
  url = 	 {https://proceedings.mlr.press/v139/zhang21u.html},
  abstract = 	 {We consider off-policy policy evaluation with function approximation (FA) in average-reward MDPs, where the goal is to estimate both the reward rate and the differential value function. For this problem, bootstrapping is necessary and, along with off-policy learning and FA, results in the deadly triad (Sutton & Barto, 2018). To address the deadly triad, we propose two novel algorithms, reproducing the celebrated success of Gradient TD algorithms in the average-reward setting. In terms of estimating the differential value function, the algorithms are the first convergent off-policy linear function approximation algorithms. In terms of estimating the reward rate, the algorithms are the first convergent off-policy linear function approximation algorithms that do not require estimating the density ratio. We demonstrate empirically the advantage of the proposed algorithms, as well as their nonlinear variants, over a competitive density-ratio-based approach, in a simple domain as well as challenging robot simulation tasks.}
}

Endnote

%0 Conference Paper
%T Average-Reward Off-Policy Policy Evaluation with Function Approximation
%A Shangtong Zhang
%A Yi Wan
%A Richard S Sutton
%A Shimon Whiteson
%B Proceedings of the 38th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2021
%E Marina Meila
%E Tong Zhang	
%F pmlr-v139-zhang21u
%I PMLR
%P 12578--12588
%U https://proceedings.mlr.press/v139/zhang21u.html
%V 139
%X We consider off-policy policy evaluation with function approximation (FA) in average-reward MDPs, where the goal is to estimate both the reward rate and the differential value function. For this problem, bootstrapping is necessary and, along with off-policy learning and FA, results in the deadly triad (Sutton & Barto, 2018). To address the deadly triad, we propose two novel algorithms, reproducing the celebrated success of Gradient TD algorithms in the average-reward setting. In terms of estimating the differential value function, the algorithms are the first convergent off-policy linear function approximation algorithms. In terms of estimating the reward rate, the algorithms are the first convergent off-policy linear function approximation algorithms that do not require estimating the density ratio. We demonstrate empirically the advantage of the proposed algorithms, as well as their nonlinear variants, over a competitive density-ratio-based approach, in a simple domain as well as challenging robot simulation tasks.

APA

Zhang, S., Wan, Y., Sutton, R.S. & Whiteson, S.. (2021). Average-Reward Off-Policy Policy Evaluation with Function Approximation. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:12578-12588 Available from https://proceedings.mlr.press/v139/zhang21u.html.

Average-Reward Off-Policy Policy Evaluation with Function Approximation

Abstract

Cite this Paper

Related Material