Improved Regret Bounds for Thompson Sampling in Linear Quadratic Control Problems
[edit]
Proceedings of the 35th International Conference on Machine Learning, PMLR 80:19, 2018.
Abstract
Thompson sampling (TS) is an effective approach to trade off exploration and exploration in reinforcement learning. Despite its empirical success and recent advances, its theoretical analysis is often limited to the Bayesian setting, finite stateaction spaces, or finitehorizon problems. In this paper, we study an instance of TS in the challenging setting of the infinitehorizon linear quadratic (LQ) control, which models problems with continuous stateaction variables, linear dynamics, and quadratic cost. In particular, we analyze the regret in the frequentist sense (i.e., for a fixed unknown environment) in onedimensional systems. We derive the first $O(\sqrt{T})$ frequentist regret bound for this problem, thus significantly improving the $O(T^{2/3})$ bound of Abeille & Lazaric (2017) and matching the frequentist performance derived by AbbasiYadkori & Szepesv{á}ri (2011) for an optimistic approach and the Bayesian result Ouyang et al. (2017) We obtain this result by developing a novel bound on the regret due to policy switches, which holds for LQ systems of any dimensionality and it allows updating the parameters and the policy at each step, thus overcoming previous limitations due to lazy updates. Finally, we report numerical simulations supporting the conjecture that our result extends to multidimensional systems.
Related Material


