Learning Infinite-horizon Average-reward Markov Decision Process with Constraints

Liyu Chen; Rahul Jain; Haipeng Luo

Learning Infinite-horizon Average-reward Markov Decision Process with Constraints

Liyu Chen, Rahul Jain, Haipeng Luo

Proceedings of the 39th International Conference on Machine Learning, PMLR 162:3246-3270, 2022.

Abstract

We study regret minimization for infinite-horizon average-reward Markov Decision Processes (MDPs) under cost constraints. We start by designing a policy optimization algorithm with carefully designed action-value estimator and bonus term, and show that for ergodic MDPs, our algorithm ensures

$O(\sqrt{T})$ regret and constant constraint violation, where

$T$ is the total number of time steps. This strictly improves over the algorithm of (Singh et al., 2020), whose regret and constraint violation are both

$O(T^{2/3})$ . Next, we consider the most general class of weakly communicating MDPs. Through a finite-horizon approximation, we develop another algorithm with

$O(T^{2/3})$ regret and constraint violation, which can be further improved to

$O(\sqrt{T})$ via a simple modification, albeit making the algorithm computationally inefficient. As far as we know, these are the first set of provable algorithms for weakly communicating MDPs with cost constraints.

Cite this Paper

BibTeX


@InProceedings{pmlr-v162-chen22i,
  title = 	 {Learning Infinite-horizon Average-reward {M}arkov Decision Process with Constraints},
  author =       {Chen, Liyu and Jain, Rahul and Luo, Haipeng},
  booktitle = 	 {Proceedings of the 39th International Conference on Machine Learning},
  pages = 	 {3246--3270},
  year = 	 {2022},
  editor = 	 {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan},
  volume = 	 {162},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {17--23 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v162/chen22i/chen22i.pdf},
  url = 	 {https://proceedings.mlr.press/v162/chen22i.html},
  abstract = 	 {We study regret minimization for infinite-horizon average-reward Markov Decision Processes (MDPs) under cost constraints. We start by designing a policy optimization algorithm with carefully designed action-value estimator and bonus term, and show that for ergodic MDPs, our algorithm ensures $O(\sqrt{T})$ regret and constant constraint violation, where $T$ is the total number of time steps. This strictly improves over the algorithm of (Singh et al., 2020), whose regret and constraint violation are both $O(T^{2/3})$. Next, we consider the most general class of weakly communicating MDPs. Through a finite-horizon approximation, we develop another algorithm with $O(T^{2/3})$ regret and constraint violation, which can be further improved to $O(\sqrt{T})$ via a simple modification, albeit making the algorithm computationally inefficient. As far as we know, these are the first set of provable algorithms for weakly communicating MDPs with cost constraints.}
}

Endnote

%0 Conference Paper
%T Learning Infinite-horizon Average-reward Markov Decision Process with Constraints
%A Liyu Chen
%A Rahul Jain
%A Haipeng Luo
%B Proceedings of the 39th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2022
%E Kamalika Chaudhuri
%E Stefanie Jegelka
%E Le Song
%E Csaba Szepesvari
%E Gang Niu
%E Sivan Sabato	
%F pmlr-v162-chen22i
%I PMLR
%P 3246--3270
%U https://proceedings.mlr.press/v162/chen22i.html
%V 162
%X We study regret minimization for infinite-horizon average-reward Markov Decision Processes (MDPs) under cost constraints. We start by designing a policy optimization algorithm with carefully designed action-value estimator and bonus term, and show that for ergodic MDPs, our algorithm ensures $O(\sqrt{T})$ regret and constant constraint violation, where $T$ is the total number of time steps. This strictly improves over the algorithm of (Singh et al., 2020), whose regret and constraint violation are both $O(T^{2/3})$. Next, we consider the most general class of weakly communicating MDPs. Through a finite-horizon approximation, we develop another algorithm with $O(T^{2/3})$ regret and constraint violation, which can be further improved to $O(\sqrt{T})$ via a simple modification, albeit making the algorithm computationally inefficient. As far as we know, these are the first set of provable algorithms for weakly communicating MDPs with cost constraints.

APA


Chen, L., Jain, R. & Luo, H.. (2022). Learning Infinite-horizon Average-reward Markov Decision Process with Constraints. Proceedings of the 39th International Conference on Machine Learning, in Proceedings of Machine Learning Research 162:3246-3270 Available from https://proceedings.mlr.press/v162/chen22i.html.

Related Material

Download PDF