Tighter ProblemDependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds
[edit]
Proceedings of the 36th International Conference on Machine Learning, PMLR 97:73047312, 2019.
Abstract
Strong worstcase performance bounds for episodic reinforcement learning exist but fortunately in practice RL algorithms perform much better than such bounds would predict. Algorithms and theory that provide strong problemdependent bounds could help illuminate the key features of what makes a RL problem hard and reduce the barrier to using RL algorithms in practice. As a step towards this we derive an algorithm and analysis for finite horizon discrete MDPs with stateoftheart worstcase regret bounds and substantially tighter bounds if the RL environment has special features but without apriori knowledge of the environment from the algorithm. As a result of our analysis, we also help address an open learning theory questionÂ \cite{jiang2018open} about episodic MDPs with a constant upperbound on the sum of rewards, providing a regret bound function of the number of episodes with no dependence on the horizon.
Related Material


