[edit]
Towards painless policy optimization for constrained MDPs
Proceedings of the Thirty-Eighth Conference on Uncertainty in Artificial Intelligence, PMLR 180:895-905, 2022.
Abstract
We study policy optimization in an infinite horizon, γ-discounted constrained Markov decision process (CMDP). Our objective is to return a policy that achieves large expected reward with a small constraint violation. We consider the online setting with linear function approximation and assume global access to the corresponding features. We propose a generic primal-dual framework that allows us to bound the reward sub-optimality and constraint violation for arbitrary algorithms in terms of their primal and dual regret on online linear optimization problems. We instantiate this framework to use coin-betting algorithms and propose the \textbf{Coin Betting Politex (CBP)} algorithm. Assuming that the action-value functions are ϵ\tiny{b}-close to the span of the d-dimensional state-action features and no sampling errors, we prove that T iterations of CBP result in an O(1(1−γ)3√T+ϵ\tiny{b}√d(1−γ)2) reward sub-optimality and an O(1(1−γ)2√T+ϵ\tiny{b}√d1−γ) constraint violation. Importantly, unlike gradient descent-ascent and other recent methods, CBP does not require extensive hyperparameter tuning. Via experiments on synthetic and Cartpole environments, we demonstrate the effectiveness and robustness of CBP.