Learning to Plan via Deep Optimistic Value Exploration

Tim Seyde; Wilko Schwarting; Sertac Karaman; Daniela Rus

Learning to Plan via Deep Optimistic Value Exploration

Tim Seyde, Wilko Schwarting, Sertac Karaman, Daniela Rus

Proceedings of the 2nd Conference on Learning for Dynamics and Control, PMLR 120:815-825, 2020.

Abstract

Deep exploration requires coordinated long-term planning. We present a model-based reinforcement learning algorithm that guides policy learning through a value function that exhibits optimism in the face of uncertainty. We capture uncertainty over values by combining predictions from an ensemble of models and formulate an upper confidence bound (UCB) objective to recover optimistic estimates. Training the policy on ensemble rollouts with the learned value function as the terminal cost allows for projecting long-term interactions into a limited planning horizon, thus enabling deep optimistic exploration. We do not assume a priori knowledge of either the dynamics or reward function. We demonstrate that our approach can accommodate both dense and sparse reward signals, while improving sample complexity on a variety of benchmarking tasks.

Cite this Paper

BibTeX


@InProceedings{pmlr-v120-seyde20a,
  title = 	 {Learning to Plan via Deep Optimistic Value Exploration},
  author =       {Seyde, Tim and Schwarting, Wilko and Karaman, Sertac and Rus, Daniela},
  booktitle = 	 {Proceedings of the 2nd Conference on Learning for Dynamics and Control},
  pages = 	 {815--825},
  year = 	 {2020},
  editor = 	 {Bayen, Alexandre M. and Jadbabaie, Ali and Pappas, George and Parrilo, Pablo A. and Recht, Benjamin and Tomlin, Claire and Zeilinger, Melanie},
  volume = 	 {120},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {10--11 Jun},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v120/seyde20a/seyde20a.pdf},
  url = 	 {https://proceedings.mlr.press/v120/seyde20a.html},
  abstract = 	 {Deep exploration requires coordinated long-term planning. We present a model-based reinforcement learning algorithm that guides policy learning through a value function that exhibits optimism in the face of uncertainty. We capture uncertainty over values by combining predictions from an ensemble of models and formulate an upper confidence bound (UCB) objective to recover optimistic estimates. Training the policy on ensemble rollouts with the learned value function as the terminal cost allows for projecting long-term interactions into a limited planning horizon, thus enabling deep optimistic exploration. We do not assume a priori knowledge of either the dynamics or reward function. We demonstrate that our approach can accommodate both dense and sparse reward signals, while improving sample complexity on a variety of benchmarking tasks.}
}

Endnote

%0 Conference Paper
%T Learning to Plan via Deep Optimistic Value Exploration
%A Tim Seyde
%A Wilko Schwarting
%A Sertac Karaman
%A Daniela Rus
%B Proceedings of the 2nd Conference on Learning for Dynamics and Control
%C Proceedings of Machine Learning Research
%D 2020
%E Alexandre M. Bayen
%E Ali Jadbabaie
%E George Pappas
%E Pablo A. Parrilo
%E Benjamin Recht
%E Claire Tomlin
%E Melanie Zeilinger	
%F pmlr-v120-seyde20a
%I PMLR
%P 815--825
%U https://proceedings.mlr.press/v120/seyde20a.html
%V 120
%X Deep exploration requires coordinated long-term planning. We present a model-based reinforcement learning algorithm that guides policy learning through a value function that exhibits optimism in the face of uncertainty. We capture uncertainty over values by combining predictions from an ensemble of models and formulate an upper confidence bound (UCB) objective to recover optimistic estimates. Training the policy on ensemble rollouts with the learned value function as the terminal cost allows for projecting long-term interactions into a limited planning horizon, thus enabling deep optimistic exploration. We do not assume a priori knowledge of either the dynamics or reward function. We demonstrate that our approach can accommodate both dense and sparse reward signals, while improving sample complexity on a variety of benchmarking tasks.

APA


Seyde, T., Schwarting, W., Karaman, S. & Rus, D.. (2020). Learning to Plan via Deep Optimistic Value Exploration. Proceedings of the 2nd Conference on Learning for Dynamics and Control, in Proceedings of Machine Learning Research 120:815-825 Available from https://proceedings.mlr.press/v120/seyde20a.html.

Related Material

Download PDF