[edit]
Minimax Regret Bounds for Reinforcement Learning
Proceedings of the 34th International Conference on Machine Learning, PMLR 70:263-272, 2017.
Abstract
We consider the problem of provably optimal exploration in reinforcement learning for finite horizon MDPs. We show that an optimistic modification to value iteration achieves a regret bound of ˜O(√HSAT+H2S2A+H√T) where H is the time horizon, S the number of states, A the number of actions and T the number of time-steps. This result improves over the best previous known bound ˜O(HS√AT) achieved by the UCRL2 algorithm. The key significance of our new results is that when T≥H3S3A and SA≥H, it leads to a regret of ˜O(√HSAT) that matches the established lower bound of Ω(√HSAT) up to a logarithmic factor. Our analysis contain two key insights. We use careful application of concentration inequalities to the optimal value function as a whole, rather than to the transitions probabilities (to improve scaling in S), and we define Bernstein-based “exploration bonuses” that use the empirical variance of the estimated values at the next states (to improve scaling in H).