Scale-free adaptive planning for deterministic dynamics & discounted rewards

Peter Bartlett; Victor Gabillon; Jennifer Healey; Michal Valko

Scale-free adaptive planning for deterministic dynamics & discounted rewards

Peter Bartlett, Victor Gabillon, Jennifer Healey, Michal Valko

Proceedings of the 36th International Conference on Machine Learning, PMLR 97:495-504, 2019.

Abstract

We address the problem of planning in an environment with deterministic dynamics and stochastic discounted rewards under a limited numerical budget where the ranges of both rewards and noise are unknown. We introduce PlaTypOOS, an adaptive, robust, and efficient alternative to the OLOP (open-loop optimistic planning) algorithm. Whereas OLOP requires a priori knowledge of the ranges of both rewards and noise, PlaTypOOS dynamically adapts its behavior to both. This allows PlaTypOOS to be immune to two vulnerabilities of OLOP: failure when given underestimated ranges of noise and rewards and inefficiency when these are overestimated. PlaTypOOS additionally adapts to the global smoothness of the value function. PlaTypOOS acts in a provably more efficient manner vs. OLOP when OLOP is given an overestimated reward and show that in the case of no noise, PlaTypOOS learns exponentially faster.

Cite this Paper

BibTeX

@InProceedings{pmlr-v97-bartlett19a,
  title = 	 {Scale-free adaptive planning for deterministic dynamics & discounted rewards},
  author =       {Bartlett, Peter and Gabillon, Victor and Healey, Jennifer and Valko, Michal},
  booktitle = 	 {Proceedings of the 36th International Conference on Machine Learning},
  pages = 	 {495--504},
  year = 	 {2019},
  editor = 	 {Chaudhuri, Kamalika and Salakhutdinov, Ruslan},
  volume = 	 {97},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {09--15 Jun},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v97/bartlett19a/bartlett19a.pdf},
  url = 	 {https://proceedings.mlr.press/v97/bartlett19a.html},
  abstract = 	 {We address the problem of planning in an environment with deterministic dynamics and stochastic discounted rewards under a limited numerical budget where the ranges of both rewards and noise are unknown. We introduce PlaTypOOS, an adaptive, robust, and efficient alternative to the OLOP (open-loop optimistic planning) algorithm. Whereas OLOP requires a priori knowledge of the ranges of both rewards and noise, PlaTypOOS dynamically adapts its behavior to both. This allows PlaTypOOS to be immune to two vulnerabilities of OLOP: failure when given underestimated ranges of noise and rewards and inefficiency when these are overestimated. PlaTypOOS additionally adapts to the global smoothness of the value function. PlaTypOOS acts in a provably more efficient manner vs. OLOP when OLOP is given an overestimated reward and show that in the case of no noise, PlaTypOOS learns exponentially faster.}
}

Endnote

%0 Conference Paper
%T Scale-free adaptive planning for deterministic dynamics & discounted rewards
%A Peter Bartlett
%A Victor Gabillon
%A Jennifer Healey
%A Michal Valko
%B Proceedings of the 36th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2019
%E Kamalika Chaudhuri
%E Ruslan Salakhutdinov	
%F pmlr-v97-bartlett19a
%I PMLR
%P 495--504
%U https://proceedings.mlr.press/v97/bartlett19a.html
%V 97
%X We address the problem of planning in an environment with deterministic dynamics and stochastic discounted rewards under a limited numerical budget where the ranges of both rewards and noise are unknown. We introduce PlaTypOOS, an adaptive, robust, and efficient alternative to the OLOP (open-loop optimistic planning) algorithm. Whereas OLOP requires a priori knowledge of the ranges of both rewards and noise, PlaTypOOS dynamically adapts its behavior to both. This allows PlaTypOOS to be immune to two vulnerabilities of OLOP: failure when given underestimated ranges of noise and rewards and inefficiency when these are overestimated. PlaTypOOS additionally adapts to the global smoothness of the value function. PlaTypOOS acts in a provably more efficient manner vs. OLOP when OLOP is given an overestimated reward and show that in the case of no noise, PlaTypOOS learns exponentially faster.

APA

Bartlett, P., Gabillon, V., Healey, J. & Valko, M.. (2019). Scale-free adaptive planning for deterministic dynamics & discounted rewards. Proceedings of the 36th International Conference on Machine Learning, in Proceedings of Machine Learning Research 97:495-504 Available from https://proceedings.mlr.press/v97/bartlett19a.html.

Scale-free adaptive planning for deterministic dynamics & discounted rewards

Abstract

Cite this Paper

Related Material