High Confidence Policy Improvement

Philip Thomas, Georgios Theocharous, Mohammad Ghavamzadeh
Proceedings of the 32nd International Conference on Machine Learning, PMLR 37:2380-2388, 2015.

Abstract

We present a batch reinforcement learning (RL) algorithm that provides probabilistic guarantees about the quality of each policy that it proposes, and which has no hyper-parameter that requires expert tuning. Specifically, the user may select any performance lower-bound and confidence level and our algorithm will ensure that the probability that it returns a policy with performance below the lower bound is at most the specified confidence level. We then propose an incremental algorithm that executes our policy improvement algorithm repeatedly to generate multiple policy improvements. We show the viability of our approach with a simple 4 x 4 gridworld and the standard mountain car problem, as well as with a digital marketing application that uses real world data.

Cite this Paper


BibTeX
@InProceedings{pmlr-v37-thomas15, title = {High Confidence Policy Improvement}, author = {Thomas, Philip and Theocharous, Georgios and Ghavamzadeh, Mohammad}, booktitle = {Proceedings of the 32nd International Conference on Machine Learning}, pages = {2380--2388}, year = {2015}, editor = {Bach, Francis and Blei, David}, volume = {37}, series = {Proceedings of Machine Learning Research}, address = {Lille, France}, month = {07--09 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v37/thomas15.pdf}, url = {https://proceedings.mlr.press/v37/thomas15.html}, abstract = {We present a batch reinforcement learning (RL) algorithm that provides probabilistic guarantees about the quality of each policy that it proposes, and which has no hyper-parameter that requires expert tuning. Specifically, the user may select any performance lower-bound and confidence level and our algorithm will ensure that the probability that it returns a policy with performance below the lower bound is at most the specified confidence level. We then propose an incremental algorithm that executes our policy improvement algorithm repeatedly to generate multiple policy improvements. We show the viability of our approach with a simple 4 x 4 gridworld and the standard mountain car problem, as well as with a digital marketing application that uses real world data.} }
Endnote
%0 Conference Paper %T High Confidence Policy Improvement %A Philip Thomas %A Georgios Theocharous %A Mohammad Ghavamzadeh %B Proceedings of the 32nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2015 %E Francis Bach %E David Blei %F pmlr-v37-thomas15 %I PMLR %P 2380--2388 %U https://proceedings.mlr.press/v37/thomas15.html %V 37 %X We present a batch reinforcement learning (RL) algorithm that provides probabilistic guarantees about the quality of each policy that it proposes, and which has no hyper-parameter that requires expert tuning. Specifically, the user may select any performance lower-bound and confidence level and our algorithm will ensure that the probability that it returns a policy with performance below the lower bound is at most the specified confidence level. We then propose an incremental algorithm that executes our policy improvement algorithm repeatedly to generate multiple policy improvements. We show the viability of our approach with a simple 4 x 4 gridworld and the standard mountain car problem, as well as with a digital marketing application that uses real world data.
RIS
TY - CPAPER TI - High Confidence Policy Improvement AU - Philip Thomas AU - Georgios Theocharous AU - Mohammad Ghavamzadeh BT - Proceedings of the 32nd International Conference on Machine Learning DA - 2015/06/01 ED - Francis Bach ED - David Blei ID - pmlr-v37-thomas15 PB - PMLR DP - Proceedings of Machine Learning Research VL - 37 SP - 2380 EP - 2388 L1 - http://proceedings.mlr.press/v37/thomas15.pdf UR - https://proceedings.mlr.press/v37/thomas15.html AB - We present a batch reinforcement learning (RL) algorithm that provides probabilistic guarantees about the quality of each policy that it proposes, and which has no hyper-parameter that requires expert tuning. Specifically, the user may select any performance lower-bound and confidence level and our algorithm will ensure that the probability that it returns a policy with performance below the lower bound is at most the specified confidence level. We then propose an incremental algorithm that executes our policy improvement algorithm repeatedly to generate multiple policy improvements. We show the viability of our approach with a simple 4 x 4 gridworld and the standard mountain car problem, as well as with a digital marketing application that uses real world data. ER -
APA
Thomas, P., Theocharous, G. & Ghavamzadeh, M.. (2015). High Confidence Policy Improvement. Proceedings of the 32nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 37:2380-2388 Available from https://proceedings.mlr.press/v37/thomas15.html.

Related Material