The Uncertainty Bellman Equation and Exploration
[edit]
Proceedings of the 35th International Conference on Machine Learning, PMLR 80:38393848, 2018.
Abstract
We consider the exploration/exploitation problem in reinforcement learning. For exploitation, it is well known that the Bellman equation connects the value at any timestep to the expected value at subsequent timesteps. In this paper we consider a similar uncertainty Bellman equation (UBE), which connects the uncertainty at any timestep to the expected uncertainties at subsequent timesteps, thereby extending the potential exploratory benefit of a policy beyond individual timesteps. We prove that the unique fixed point of the UBE yields an upper bound on the variance of the posterior distribution of the Qvalues induced by any policy. This bound can be much tighter than traditional countbased bonuses that compound standard deviation rather than variance. Importantly, and unlike several existing approaches to optimism, this method scales naturally to large systems with complex generalization. Substituting our UBEexploration strategy for $\epsilon$greedy improves DQN performance on 51 out of 57 games in the Atari suite.
Related Material


