High Confidence Policy Improvement

Philip Thomas; Georgios Theocharous; Mohammad Ghavamzadeh

High Confidence Policy Improvement

Philip Thomas, Georgios Theocharous, Mohammad Ghavamzadeh

Proceedings of the 32nd International Conference on Machine Learning, PMLR 37:2380-2388, 2015.

Abstract

We present a batch reinforcement learning (RL) algorithm that provides probabilistic guarantees about the quality of each policy that it proposes, and which has no hyper-parameter that requires expert tuning. Specifically, the user may select any performance lower-bound and confidence level and our algorithm will ensure that the probability that it returns a policy with performance below the lower bound is at most the specified confidence level. We then propose an incremental algorithm that executes our policy improvement algorithm repeatedly to generate multiple policy improvements. We show the viability of our approach with a simple 4 x 4 gridworld and the standard mountain car problem, as well as with a digital marketing application that uses real world data.

Cite this Paper

BibTeX


@InProceedings{pmlr-v37-thomas15,
  title = 	 {High Confidence Policy Improvement},
  author = 	 {Thomas, Philip and Theocharous, Georgios and Ghavamzadeh, Mohammad},
  booktitle = 	 {Proceedings of the 32nd International Conference on Machine Learning},
  pages = 	 {2380--2388},
  year = 	 {2015},
  editor = 	 {Bach, Francis and Blei, David},
  volume = 	 {37},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {Lille, France},
  month = 	 {07--09 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v37/thomas15.pdf},
  url = 	 {https://proceedings.mlr.press/v37/thomas15.html},
  abstract = 	 {We present a batch reinforcement learning (RL) algorithm that provides probabilistic guarantees about the quality of each policy that it proposes, and which has no hyper-parameter that requires expert tuning. Specifically, the user may select any performance lower-bound and confidence level and our algorithm will ensure that the probability that it returns a policy with performance below the lower bound is at most the specified confidence level. We then propose an incremental algorithm that executes our policy improvement algorithm repeatedly to generate multiple policy improvements. We show the viability of our approach with a simple 4 x 4 gridworld and the standard mountain car problem, as well as with a digital marketing application that uses real world data.}
}

Endnote

%0 Conference Paper
%T High Confidence Policy Improvement
%A Philip Thomas
%A Georgios Theocharous
%A Mohammad Ghavamzadeh
%B Proceedings of the 32nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2015
%E Francis Bach
%E David Blei	
%F pmlr-v37-thomas15
%I PMLR
%P 2380--2388
%U https://proceedings.mlr.press/v37/thomas15.html
%V 37
%X We present a batch reinforcement learning (RL) algorithm that provides probabilistic guarantees about the quality of each policy that it proposes, and which has no hyper-parameter that requires expert tuning. Specifically, the user may select any performance lower-bound and confidence level and our algorithm will ensure that the probability that it returns a policy with performance below the lower bound is at most the specified confidence level. We then propose an incremental algorithm that executes our policy improvement algorithm repeatedly to generate multiple policy improvements. We show the viability of our approach with a simple 4 x 4 gridworld and the standard mountain car problem, as well as with a digital marketing application that uses real world data.

RIS


TY  - CPAPER
TI  - High Confidence Policy Improvement
AU  - Philip Thomas
AU  - Georgios Theocharous
AU  - Mohammad Ghavamzadeh
BT  - Proceedings of the 32nd International Conference on Machine Learning
DA  - 2015/06/01
ED  - Francis Bach
ED  - David Blei	
ID  - pmlr-v37-thomas15
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 37
SP  - 2380
EP  - 2388
L1  - http://proceedings.mlr.press/v37/thomas15.pdf
UR  - https://proceedings.mlr.press/v37/thomas15.html
AB  - We present a batch reinforcement learning (RL) algorithm that provides probabilistic guarantees about the quality of each policy that it proposes, and which has no hyper-parameter that requires expert tuning. Specifically, the user may select any performance lower-bound and confidence level and our algorithm will ensure that the probability that it returns a policy with performance below the lower bound is at most the specified confidence level. We then propose an incremental algorithm that executes our policy improvement algorithm repeatedly to generate multiple policy improvements. We show the viability of our approach with a simple 4 x 4 gridworld and the standard mountain car problem, as well as with a digital marketing application that uses real world data.
ER  -

APA


Thomas, P., Theocharous, G. & Ghavamzadeh, M.. (2015). High Confidence Policy Improvement. Proceedings of the 32nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 37:2380-2388 Available from https://proceedings.mlr.press/v37/thomas15.html.

Related Material

Download PDF