Provably Efficient Maximum Entropy Exploration

Elad Hazan, Sham Kakade, Karan Singh, Abby Van Soest
Proceedings of the 36th International Conference on Machine Learning, PMLR 97:2681-2691, 2019.

Abstract

Suppose an agent is in a (possibly unknown) Markov Decision Process in the absence of a reward signal, what might we hope that an agent can efficiently learn to do? This work studies a broad class of objectives that are defined solely as functions of the state-visitation frequencies that are induced by how the agent behaves. For example, one natural, intrinsically defined, objective problem is for the agent to learn a policy which induces a distribution over state space that is as uniform as possible, which can be measured in an entropic sense. We provide an efficient algorithm to optimize such such intrinsically defined objectives, when given access to a black box planning oracle (which is robust to function approximation). Furthermore, when restricted to the tabular setting where we have sample based access to the MDP, our proposed algorithm is provably efficient, both in terms of its sample and computational complexities. Key to our algorithmic methodology is utilizing the conditional gradient method (a.k.a. the Frank-Wolfe algorithm) which utilizes an approximate MDP solver.

Cite this Paper


BibTeX
@InProceedings{pmlr-v97-hazan19a, title = {Provably Efficient Maximum Entropy Exploration}, author = {Hazan, Elad and Kakade, Sham and Singh, Karan and Van Soest, Abby}, booktitle = {Proceedings of the 36th International Conference on Machine Learning}, pages = {2681--2691}, year = {2019}, editor = {Chaudhuri, Kamalika and Salakhutdinov, Ruslan}, volume = {97}, series = {Proceedings of Machine Learning Research}, month = {09--15 Jun}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v97/hazan19a/hazan19a.pdf}, url = {https://proceedings.mlr.press/v97/hazan19a.html}, abstract = {Suppose an agent is in a (possibly unknown) Markov Decision Process in the absence of a reward signal, what might we hope that an agent can efficiently learn to do? This work studies a broad class of objectives that are defined solely as functions of the state-visitation frequencies that are induced by how the agent behaves. For example, one natural, intrinsically defined, objective problem is for the agent to learn a policy which induces a distribution over state space that is as uniform as possible, which can be measured in an entropic sense. We provide an efficient algorithm to optimize such such intrinsically defined objectives, when given access to a black box planning oracle (which is robust to function approximation). Furthermore, when restricted to the tabular setting where we have sample based access to the MDP, our proposed algorithm is provably efficient, both in terms of its sample and computational complexities. Key to our algorithmic methodology is utilizing the conditional gradient method (a.k.a. the Frank-Wolfe algorithm) which utilizes an approximate MDP solver.} }
Endnote
%0 Conference Paper %T Provably Efficient Maximum Entropy Exploration %A Elad Hazan %A Sham Kakade %A Karan Singh %A Abby Van Soest %B Proceedings of the 36th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2019 %E Kamalika Chaudhuri %E Ruslan Salakhutdinov %F pmlr-v97-hazan19a %I PMLR %P 2681--2691 %U https://proceedings.mlr.press/v97/hazan19a.html %V 97 %X Suppose an agent is in a (possibly unknown) Markov Decision Process in the absence of a reward signal, what might we hope that an agent can efficiently learn to do? This work studies a broad class of objectives that are defined solely as functions of the state-visitation frequencies that are induced by how the agent behaves. For example, one natural, intrinsically defined, objective problem is for the agent to learn a policy which induces a distribution over state space that is as uniform as possible, which can be measured in an entropic sense. We provide an efficient algorithm to optimize such such intrinsically defined objectives, when given access to a black box planning oracle (which is robust to function approximation). Furthermore, when restricted to the tabular setting where we have sample based access to the MDP, our proposed algorithm is provably efficient, both in terms of its sample and computational complexities. Key to our algorithmic methodology is utilizing the conditional gradient method (a.k.a. the Frank-Wolfe algorithm) which utilizes an approximate MDP solver.
APA
Hazan, E., Kakade, S., Singh, K. & Van Soest, A.. (2019). Provably Efficient Maximum Entropy Exploration. Proceedings of the 36th International Conference on Machine Learning, in Proceedings of Machine Learning Research 97:2681-2691 Available from https://proceedings.mlr.press/v97/hazan19a.html.

Related Material