Problem Dependent Reinforcement Learning Bounds Which Can Identify Bandit Structure in MDPs

Andrea Zanette, Emma Brunskill
Proceedings of the 35th International Conference on Machine Learning, PMLR 80:5747-5755, 2018.

Abstract

In order to make good decision under uncertainty an agent must learn from observations. To do so, two of the most common frameworks are Contextual Bandits and Markov Decision Processes (MDPs). In this paper, we study whether there exist algorithms for the more general framework (MDP) which automatically provide the best performance bounds for the specific problem at hand without user intervention and without modifying the algorithm. In particular, it is found that a very minor variant of a recently proposed reinforcement learning algorithm for MDPs already matches the best possible regret bound $\tilde O (\sqrt{SAT})$ in the dominant term if deployed on a tabular Contextual Bandit problem despite the agent being agnostic to such setting.

Cite this Paper


BibTeX
@InProceedings{pmlr-v80-zanette18a, title = {Problem Dependent Reinforcement Learning Bounds Which Can Identify Bandit Structure in {MDP}s}, author = {Zanette, Andrea and Brunskill, Emma}, booktitle = {Proceedings of the 35th International Conference on Machine Learning}, pages = {5747--5755}, year = {2018}, editor = {Dy, Jennifer and Krause, Andreas}, volume = {80}, series = {Proceedings of Machine Learning Research}, month = {10--15 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v80/zanette18a/zanette18a.pdf}, url = {http://proceedings.mlr.press/v80/zanette18a.html}, abstract = {In order to make good decision under uncertainty an agent must learn from observations. To do so, two of the most common frameworks are Contextual Bandits and Markov Decision Processes (MDPs). In this paper, we study whether there exist algorithms for the more general framework (MDP) which automatically provide the best performance bounds for the specific problem at hand without user intervention and without modifying the algorithm. In particular, it is found that a very minor variant of a recently proposed reinforcement learning algorithm for MDPs already matches the best possible regret bound $\tilde O (\sqrt{SAT})$ in the dominant term if deployed on a tabular Contextual Bandit problem despite the agent being agnostic to such setting.} }
Endnote
%0 Conference Paper %T Problem Dependent Reinforcement Learning Bounds Which Can Identify Bandit Structure in MDPs %A Andrea Zanette %A Emma Brunskill %B Proceedings of the 35th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2018 %E Jennifer Dy %E Andreas Krause %F pmlr-v80-zanette18a %I PMLR %P 5747--5755 %U http://proceedings.mlr.press/v80/zanette18a.html %V 80 %X In order to make good decision under uncertainty an agent must learn from observations. To do so, two of the most common frameworks are Contextual Bandits and Markov Decision Processes (MDPs). In this paper, we study whether there exist algorithms for the more general framework (MDP) which automatically provide the best performance bounds for the specific problem at hand without user intervention and without modifying the algorithm. In particular, it is found that a very minor variant of a recently proposed reinforcement learning algorithm for MDPs already matches the best possible regret bound $\tilde O (\sqrt{SAT})$ in the dominant term if deployed on a tabular Contextual Bandit problem despite the agent being agnostic to such setting.
APA
Zanette, A. & Brunskill, E.. (2018). Problem Dependent Reinforcement Learning Bounds Which Can Identify Bandit Structure in MDPs. Proceedings of the 35th International Conference on Machine Learning, in Proceedings of Machine Learning Research 80:5747-5755 Available from http://proceedings.mlr.press/v80/zanette18a.html.

Related Material