Bayesian RL for GoalOnly Rewards
[edit]
Proceedings of The 2nd Conference on Robot Learning, PMLR 87:386398, 2018.
Abstract
We address the challenging problem of reinforcement learning under goalonly rewards [1], where rewards are only nonzero when the goal is achieved. This reward definition alleviates the need for cumbersome reward engineering, making the reward formulation trivial. Classic exploration heuristics such as Boltzmann or epsilongreedy exploration are highly inefficient in domains with goalonly rewards. We solve this problem by leveraging value function posterior variance information to direct exploration where uncertainty is higher. The proposed algorithm (EMUQ) achieves dataefficient exploration, and balances exploration and exploitation explicitly at a policy level granting users more control over the learning process. We introduce general features approximating kernels, allowing to greatly reduce the algorithm complexity from O(N^3) in the number of transitions to O(M^2) in the number of features. We demonstrate EMUQ is competitive with other exploration techniques on a variety of continuous control tasks and on a robotic manipulator.
Related Material


