Bayesian RL for Goal-Only Rewards

Philippe Morere, Fabio Ramos
; Proceedings of The 2nd Conference on Robot Learning, PMLR 87:386-398, 2018.

Abstract

We address the challenging problem of reinforcement learning under goal-only rewards [1], where rewards are only non-zero when the goal is achieved. This reward definition alleviates the need for cumbersome reward engineering, making the reward formulation trivial. Classic exploration heuristics such as Boltzmann or epsilon-greedy exploration are highly inefficient in domains with goal-only rewards. We solve this problem by leveraging value function posterior variance information to direct exploration where uncertainty is higher. The proposed algorithm (EMU-Q) achieves data-efficient exploration, and balances exploration and exploitation explicitly at a policy level granting users more control over the learning process. We introduce general features approximating kernels, allowing to greatly reduce the algorithm complexity from O(N^3) in the number of transitions to O(M^2) in the number of features. We demonstrate EMU-Q is competitive with other exploration techniques on a variety of continuous control tasks and on a robotic manipulator.

Cite this Paper


BibTeX
@InProceedings{pmlr-v87-morere18a, title = {Bayesian RL for Goal-Only Rewards}, author = {Morere, Philippe and Ramos, Fabio}, booktitle = {Proceedings of The 2nd Conference on Robot Learning}, pages = {386--398}, year = {2018}, editor = {Aude Billard and Anca Dragan and Jan Peters and Jun Morimoto}, volume = {87}, series = {Proceedings of Machine Learning Research}, address = {}, month = {29--31 Oct}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v87/morere18a/morere18a.pdf}, url = {http://proceedings.mlr.press/v87/morere18a.html}, abstract = {We address the challenging problem of reinforcement learning under goal-only rewards [1], where rewards are only non-zero when the goal is achieved. This reward definition alleviates the need for cumbersome reward engineering, making the reward formulation trivial. Classic exploration heuristics such as Boltzmann or epsilon-greedy exploration are highly inefficient in domains with goal-only rewards. We solve this problem by leveraging value function posterior variance information to direct exploration where uncertainty is higher. The proposed algorithm (EMU-Q) achieves data-efficient exploration, and balances exploration and exploitation explicitly at a policy level granting users more control over the learning process. We introduce general features approximating kernels, allowing to greatly reduce the algorithm complexity from O(N^3) in the number of transitions to O(M^2) in the number of features. We demonstrate EMU-Q is competitive with other exploration techniques on a variety of continuous control tasks and on a robotic manipulator. } }
Endnote
%0 Conference Paper %T Bayesian RL for Goal-Only Rewards %A Philippe Morere %A Fabio Ramos %B Proceedings of The 2nd Conference on Robot Learning %C Proceedings of Machine Learning Research %D 2018 %E Aude Billard %E Anca Dragan %E Jan Peters %E Jun Morimoto %F pmlr-v87-morere18a %I PMLR %J Proceedings of Machine Learning Research %P 386--398 %U http://proceedings.mlr.press %V 87 %W PMLR %X We address the challenging problem of reinforcement learning under goal-only rewards [1], where rewards are only non-zero when the goal is achieved. This reward definition alleviates the need for cumbersome reward engineering, making the reward formulation trivial. Classic exploration heuristics such as Boltzmann or epsilon-greedy exploration are highly inefficient in domains with goal-only rewards. We solve this problem by leveraging value function posterior variance information to direct exploration where uncertainty is higher. The proposed algorithm (EMU-Q) achieves data-efficient exploration, and balances exploration and exploitation explicitly at a policy level granting users more control over the learning process. We introduce general features approximating kernels, allowing to greatly reduce the algorithm complexity from O(N^3) in the number of transitions to O(M^2) in the number of features. We demonstrate EMU-Q is competitive with other exploration techniques on a variety of continuous control tasks and on a robotic manipulator.
APA
Morere, P. & Ramos, F.. (2018). Bayesian RL for Goal-Only Rewards. Proceedings of The 2nd Conference on Robot Learning, in PMLR 87:386-398

Related Material