Average-Reward Off-Policy Policy Evaluation with Function Approximation

Shangtong Zhang, Yi Wan, Richard S Sutton, Shimon Whiteson
Proceedings of the 38th International Conference on Machine Learning, PMLR 139:12578-12588, 2021.

Abstract

We consider off-policy policy evaluation with function approximation (FA) in average-reward MDPs, where the goal is to estimate both the reward rate and the differential value function. For this problem, bootstrapping is necessary and, along with off-policy learning and FA, results in the deadly triad (Sutton & Barto, 2018). To address the deadly triad, we propose two novel algorithms, reproducing the celebrated success of Gradient TD algorithms in the average-reward setting. In terms of estimating the differential value function, the algorithms are the first convergent off-policy linear function approximation algorithms. In terms of estimating the reward rate, the algorithms are the first convergent off-policy linear function approximation algorithms that do not require estimating the density ratio. We demonstrate empirically the advantage of the proposed algorithms, as well as their nonlinear variants, over a competitive density-ratio-based approach, in a simple domain as well as challenging robot simulation tasks.

Cite this Paper


BibTeX
@InProceedings{pmlr-v139-zhang21u, title = {Average-Reward Off-Policy Policy Evaluation with Function Approximation}, author = {Zhang, Shangtong and Wan, Yi and Sutton, Richard S and Whiteson, Shimon}, booktitle = {Proceedings of the 38th International Conference on Machine Learning}, pages = {12578--12588}, year = {2021}, editor = {Meila, Marina and Zhang, Tong}, volume = {139}, series = {Proceedings of Machine Learning Research}, month = {18--24 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v139/zhang21u/zhang21u.pdf}, url = {https://proceedings.mlr.press/v139/zhang21u.html}, abstract = {We consider off-policy policy evaluation with function approximation (FA) in average-reward MDPs, where the goal is to estimate both the reward rate and the differential value function. For this problem, bootstrapping is necessary and, along with off-policy learning and FA, results in the deadly triad (Sutton & Barto, 2018). To address the deadly triad, we propose two novel algorithms, reproducing the celebrated success of Gradient TD algorithms in the average-reward setting. In terms of estimating the differential value function, the algorithms are the first convergent off-policy linear function approximation algorithms. In terms of estimating the reward rate, the algorithms are the first convergent off-policy linear function approximation algorithms that do not require estimating the density ratio. We demonstrate empirically the advantage of the proposed algorithms, as well as their nonlinear variants, over a competitive density-ratio-based approach, in a simple domain as well as challenging robot simulation tasks.} }
Endnote
%0 Conference Paper %T Average-Reward Off-Policy Policy Evaluation with Function Approximation %A Shangtong Zhang %A Yi Wan %A Richard S Sutton %A Shimon Whiteson %B Proceedings of the 38th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2021 %E Marina Meila %E Tong Zhang %F pmlr-v139-zhang21u %I PMLR %P 12578--12588 %U https://proceedings.mlr.press/v139/zhang21u.html %V 139 %X We consider off-policy policy evaluation with function approximation (FA) in average-reward MDPs, where the goal is to estimate both the reward rate and the differential value function. For this problem, bootstrapping is necessary and, along with off-policy learning and FA, results in the deadly triad (Sutton & Barto, 2018). To address the deadly triad, we propose two novel algorithms, reproducing the celebrated success of Gradient TD algorithms in the average-reward setting. In terms of estimating the differential value function, the algorithms are the first convergent off-policy linear function approximation algorithms. In terms of estimating the reward rate, the algorithms are the first convergent off-policy linear function approximation algorithms that do not require estimating the density ratio. We demonstrate empirically the advantage of the proposed algorithms, as well as their nonlinear variants, over a competitive density-ratio-based approach, in a simple domain as well as challenging robot simulation tasks.
APA
Zhang, S., Wan, Y., Sutton, R.S. & Whiteson, S.. (2021). Average-Reward Off-Policy Policy Evaluation with Function Approximation. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:12578-12588 Available from https://proceedings.mlr.press/v139/zhang21u.html.

Related Material