Learning and Planning in Average-Reward Markov Decision Processes

Yi Wan; Abhishek Naik; Richard S Sutton

Learning and Planning in Average-Reward Markov Decision Processes

Yi Wan, Abhishek Naik, Richard S Sutton

Proceedings of the 38th International Conference on Machine Learning, PMLR 139:10653-10662, 2021.

Abstract

We introduce learning and planning algorithms for average-reward MDPs, including 1) the first general proven-convergent off-policy model-free control algorithm without reference states, 2) the first proven-convergent off-policy model-free prediction algorithm, and 3) the first off-policy learning algorithm that converges to the actual value function rather than to the value function plus an offset. All of our algorithms are based on using the temporal-difference error rather than the conventional error when updating the estimate of the average reward. Our proof techniques are a slight generalization of those by Abounadi, Bertsekas, and Borkar (2001). In experiments with an Access-Control Queuing Task, we show some of the difficulties that can arise when using methods that rely on reference states and argue that our new algorithms are significantly easier to use.

Cite this Paper

BibTeX

@InProceedings{pmlr-v139-wan21a,
  title = 	 {Learning and Planning in Average-Reward Markov Decision Processes},
  author =       {Wan, Yi and Naik, Abhishek and Sutton, Richard S},
  booktitle = 	 {Proceedings of the 38th International Conference on Machine Learning},
  pages = 	 {10653--10662},
  year = 	 {2021},
  editor = 	 {Meila, Marina and Zhang, Tong},
  volume = 	 {139},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {18--24 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v139/wan21a/wan21a.pdf},
  url = 	 {https://proceedings.mlr.press/v139/wan21a.html},
  abstract = 	 {We introduce learning and planning algorithms for average-reward MDPs, including 1) the first general proven-convergent off-policy model-free control algorithm without reference states, 2) the first proven-convergent off-policy model-free prediction algorithm, and 3) the first off-policy learning algorithm that converges to the actual value function rather than to the value function plus an offset. All of our algorithms are based on using the temporal-difference error rather than the conventional error when updating the estimate of the average reward. Our proof techniques are a slight generalization of those by Abounadi, Bertsekas, and Borkar (2001). In experiments with an Access-Control Queuing Task, we show some of the difficulties that can arise when using methods that rely on reference states and argue that our new algorithms are significantly easier to use.}
}

Endnote

%0 Conference Paper
%T Learning and Planning in Average-Reward Markov Decision Processes
%A Yi Wan
%A Abhishek Naik
%A Richard S Sutton
%B Proceedings of the 38th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2021
%E Marina Meila
%E Tong Zhang	
%F pmlr-v139-wan21a
%I PMLR
%P 10653--10662
%U https://proceedings.mlr.press/v139/wan21a.html
%V 139
%X We introduce learning and planning algorithms for average-reward MDPs, including 1) the first general proven-convergent off-policy model-free control algorithm without reference states, 2) the first proven-convergent off-policy model-free prediction algorithm, and 3) the first off-policy learning algorithm that converges to the actual value function rather than to the value function plus an offset. All of our algorithms are based on using the temporal-difference error rather than the conventional error when updating the estimate of the average reward. Our proof techniques are a slight generalization of those by Abounadi, Bertsekas, and Borkar (2001). In experiments with an Access-Control Queuing Task, we show some of the difficulties that can arise when using methods that rely on reference states and argue that our new algorithms are significantly easier to use.

APA

Wan, Y., Naik, A. & Sutton, R.S.. (2021). Learning and Planning in Average-Reward Markov Decision Processes. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:10653-10662 Available from https://proceedings.mlr.press/v139/wan21a.html.

Learning and Planning in Average-Reward Markov Decision Processes

Abstract

Cite this Paper

Related Material