Policy Evaluation for Variance in Average Reward Reinforcement Learning

Shubhada Agrawal, Prashanth L A, Siva Theja Maguluri
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:471-502, 2024.

Abstract

We consider an average reward reinforcement learning (RL) problem and work with asymptotic variance as a risk measure to model safety-critical applications. We design a temporal-difference (TD) type algorithm tailored for policy evaluation in this context. Our algorithm is based on linear stochastic approximation of an equivalent formulation of the asymptotic variance in terms of the solution of the Poisson equation. We consider both the tabular and linear function approximation settings, and establish $\tilde {O}(1/k)$ finite time convergence rate, where $k$ is the number of steps of the algorithm. Our work paves the way for developing actor-critic style algorithms for variance-constrained RL. To the best of our knowledge, our result provides the first sequential estimator for asymptotic variance of a Markov chain with provable finite sample guarantees, which is of independent interest.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-agrawal24a, title = {Policy Evaluation for Variance in Average Reward Reinforcement Learning}, author = {Agrawal, Shubhada and A, Prashanth L and Maguluri, Siva Theja}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {471--502}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/agrawal24a/agrawal24a.pdf}, url = {https://proceedings.mlr.press/v235/agrawal24a.html}, abstract = {We consider an average reward reinforcement learning (RL) problem and work with asymptotic variance as a risk measure to model safety-critical applications. We design a temporal-difference (TD) type algorithm tailored for policy evaluation in this context. Our algorithm is based on linear stochastic approximation of an equivalent formulation of the asymptotic variance in terms of the solution of the Poisson equation. We consider both the tabular and linear function approximation settings, and establish $\tilde {O}(1/k)$ finite time convergence rate, where $k$ is the number of steps of the algorithm. Our work paves the way for developing actor-critic style algorithms for variance-constrained RL. To the best of our knowledge, our result provides the first sequential estimator for asymptotic variance of a Markov chain with provable finite sample guarantees, which is of independent interest.} }
Endnote
%0 Conference Paper %T Policy Evaluation for Variance in Average Reward Reinforcement Learning %A Shubhada Agrawal %A Prashanth L A %A Siva Theja Maguluri %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-agrawal24a %I PMLR %P 471--502 %U https://proceedings.mlr.press/v235/agrawal24a.html %V 235 %X We consider an average reward reinforcement learning (RL) problem and work with asymptotic variance as a risk measure to model safety-critical applications. We design a temporal-difference (TD) type algorithm tailored for policy evaluation in this context. Our algorithm is based on linear stochastic approximation of an equivalent formulation of the asymptotic variance in terms of the solution of the Poisson equation. We consider both the tabular and linear function approximation settings, and establish $\tilde {O}(1/k)$ finite time convergence rate, where $k$ is the number of steps of the algorithm. Our work paves the way for developing actor-critic style algorithms for variance-constrained RL. To the best of our knowledge, our result provides the first sequential estimator for asymptotic variance of a Markov chain with provable finite sample guarantees, which is of independent interest.
APA
Agrawal, S., A, P.L. & Maguluri, S.T.. (2024). Policy Evaluation for Variance in Average Reward Reinforcement Learning. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:471-502 Available from https://proceedings.mlr.press/v235/agrawal24a.html.

Related Material