[edit]
A Duality Approach for Regret Minimization in Average-Award Ergodic Markov Decision Processes
Proceedings of the 2nd Conference on Learning for Dynamics and Control, PMLR 120:862-883, 2020.
Abstract
In light of the Bellman duality, we propose a novel value-policy gradient algorithm to explore and act in infinite-horizon Average-reward Markov Decision Process (AMDP) and show that it has sublinear regret. The algorithm is motivated by the Bellman saddle point formulation. It learns the optimal state-action distribution, which encodes a randomized policy, by interacting with the environment along a single trajectory and making primal-dual updates. The key to the analysis is to establish a connection between the min-max duality gap of Bellman saddle point and the cumulative regret of the learning agent. We show that, for ergodic AMDPs with finite state space S and action space A and uniformly bounded mixing times, the algorithm’s T-time step regret is R(T)=˜O((t∗mix)2τ32√(τ3+|A|)|S|T), where t∗mix is the worst-case mixing time, τ is an ergodicity parameter, T is the number of time steps and ˜O hides polylog factors.