Policy Certificates: Towards Accountable Reinforcement Learning

Christoph Dann, Lihong Li, Wei Wei, Emma Brunskill
Proceedings of the 36th International Conference on Machine Learning, PMLR 97:1507-1516, 2019.

Abstract

The performance of a reinforcement learning algorithm can vary drastically during learning because of exploration. Existing algorithms provide little information about the quality of their current policy before executing it, and thus have limited use in high-stakes applications like healthcare. We address this lack of accountability by proposing that algorithms output policy certificates. These certificates bound the sub-optimality and return of the policy in the next episode, allowing humans to intervene when the certified quality is not satisfactory. We further introduce two new algorithms with certificates and present a new framework for theoretical analysis that guarantees the quality of their policies and certificates. For tabular MDPs, we show that computing certificates can even improve the sample-efficiency of optimism-based exploration. As a result, one of our algorithms is the first to achieve minimax-optimal PAC bounds up to lower-order terms, and this algorithm also matches (and in some settings slightly improves upon) existing minimax regret bounds.

Cite this Paper


BibTeX
@InProceedings{pmlr-v97-dann19a, title = {Policy Certificates: Towards Accountable Reinforcement Learning}, author = {Dann, Christoph and Li, Lihong and Wei, Wei and Brunskill, Emma}, booktitle = {Proceedings of the 36th International Conference on Machine Learning}, pages = {1507--1516}, year = {2019}, editor = {Chaudhuri, Kamalika and Salakhutdinov, Ruslan}, volume = {97}, series = {Proceedings of Machine Learning Research}, month = {09--15 Jun}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v97/dann19a/dann19a.pdf}, url = {https://proceedings.mlr.press/v97/dann19a.html}, abstract = {The performance of a reinforcement learning algorithm can vary drastically during learning because of exploration. Existing algorithms provide little information about the quality of their current policy before executing it, and thus have limited use in high-stakes applications like healthcare. We address this lack of accountability by proposing that algorithms output policy certificates. These certificates bound the sub-optimality and return of the policy in the next episode, allowing humans to intervene when the certified quality is not satisfactory. We further introduce two new algorithms with certificates and present a new framework for theoretical analysis that guarantees the quality of their policies and certificates. For tabular MDPs, we show that computing certificates can even improve the sample-efficiency of optimism-based exploration. As a result, one of our algorithms is the first to achieve minimax-optimal PAC bounds up to lower-order terms, and this algorithm also matches (and in some settings slightly improves upon) existing minimax regret bounds.} }
Endnote
%0 Conference Paper %T Policy Certificates: Towards Accountable Reinforcement Learning %A Christoph Dann %A Lihong Li %A Wei Wei %A Emma Brunskill %B Proceedings of the 36th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2019 %E Kamalika Chaudhuri %E Ruslan Salakhutdinov %F pmlr-v97-dann19a %I PMLR %P 1507--1516 %U https://proceedings.mlr.press/v97/dann19a.html %V 97 %X The performance of a reinforcement learning algorithm can vary drastically during learning because of exploration. Existing algorithms provide little information about the quality of their current policy before executing it, and thus have limited use in high-stakes applications like healthcare. We address this lack of accountability by proposing that algorithms output policy certificates. These certificates bound the sub-optimality and return of the policy in the next episode, allowing humans to intervene when the certified quality is not satisfactory. We further introduce two new algorithms with certificates and present a new framework for theoretical analysis that guarantees the quality of their policies and certificates. For tabular MDPs, we show that computing certificates can even improve the sample-efficiency of optimism-based exploration. As a result, one of our algorithms is the first to achieve minimax-optimal PAC bounds up to lower-order terms, and this algorithm also matches (and in some settings slightly improves upon) existing minimax regret bounds.
APA
Dann, C., Li, L., Wei, W. & Brunskill, E.. (2019). Policy Certificates: Towards Accountable Reinforcement Learning. Proceedings of the 36th International Conference on Machine Learning, in Proceedings of Machine Learning Research 97:1507-1516 Available from https://proceedings.mlr.press/v97/dann19a.html.

Related Material