Minimax-optimal reward-agnostic exploration in reinforcement learning

Gen Li, Yuling Yan, Yuxin Chen, Jianqing Fan
Proceedings of Thirty Seventh Conference on Learning Theory, PMLR 247:3431-3436, 2024.

Abstract

This paper studies reward-agnostic exploration in reinforcement learning (RL) — a scenario where the learner is unware of the reward functions during the exploration stage — and designs an algorithm that improves over the state of the art. More precisely, consider a finite-horizon inhomogeneous Markov decision process with S states, A actions, and horizon length H, and suppose that there are no more than a polynomial number of given reward functions of interest. By collecting an order of SAH3ε2 sample episodes (up to log factor) without guidance of the reward information, our algorithm is able to find ε-optimal policies for all these reward functions, provided that ε is sufficiently small. This forms the first reward-agnostic exploration scheme in this context that achieves provable minimax optimality. Furthermore, once the sample size exceeds S2AH3ε2 episodes (up to log factor), our algorithm is able to yield ε accuracy for arbitrarily many reward functions (even when they are adversarially designed), a task commonly dubbed as “reward-free exploration.” The novelty of our algorithm design draws on insights from offline RL: the exploration scheme attempts to maximize a critical reward-agnostic quantity that dictates the performance of offline RL, while the policy learning paradigm leverages ideas from sample-optimal offline RL paradigms.

Cite this Paper


BibTeX
@InProceedings{pmlr-v247-li24a, title = {Minimax-optimal reward-agnostic exploration in reinforcement learning}, author = {Li, Gen and Yan, Yuling and Chen, Yuxin and Fan, Jianqing}, booktitle = {Proceedings of Thirty Seventh Conference on Learning Theory}, pages = {3431--3436}, year = {2024}, editor = {Agrawal, Shipra and Roth, Aaron}, volume = {247}, series = {Proceedings of Machine Learning Research}, month = {30 Jun--03 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v247/li24a/li24a.pdf}, url = {https://proceedings.mlr.press/v247/li24a.html}, abstract = {This paper studies reward-agnostic exploration in reinforcement learning (RL) — a scenario where the learner is unware of the reward functions during the exploration stage — and designs an algorithm that improves over the state of the art. More precisely, consider a finite-horizon inhomogeneous Markov decision process with $S$ states, $A$ actions, and horizon length $H$, and suppose that there are no more than a polynomial number of given reward functions of interest. By collecting an order of $\frac{SAH^3}{\varepsilon^2}$ sample episodes (up to log factor) without guidance of the reward information, our algorithm is able to find $\varepsilon$-optimal policies for all these reward functions, provided that $\varepsilon$ is sufficiently small. This forms the first reward-agnostic exploration scheme in this context that achieves provable minimax optimality. Furthermore, once the sample size exceeds $\frac{S^2AH^3}{\varepsilon^2}$ episodes (up to log factor), our algorithm is able to yield $\varepsilon$ accuracy for arbitrarily many reward functions (even when they are adversarially designed), a task commonly dubbed as “reward-free exploration.” The novelty of our algorithm design draws on insights from offline RL: the exploration scheme attempts to maximize a critical reward-agnostic quantity that dictates the performance of offline RL, while the policy learning paradigm leverages ideas from sample-optimal offline RL paradigms. } }
Endnote
%0 Conference Paper %T Minimax-optimal reward-agnostic exploration in reinforcement learning %A Gen Li %A Yuling Yan %A Yuxin Chen %A Jianqing Fan %B Proceedings of Thirty Seventh Conference on Learning Theory %C Proceedings of Machine Learning Research %D 2024 %E Shipra Agrawal %E Aaron Roth %F pmlr-v247-li24a %I PMLR %P 3431--3436 %U https://proceedings.mlr.press/v247/li24a.html %V 247 %X This paper studies reward-agnostic exploration in reinforcement learning (RL) — a scenario where the learner is unware of the reward functions during the exploration stage — and designs an algorithm that improves over the state of the art. More precisely, consider a finite-horizon inhomogeneous Markov decision process with $S$ states, $A$ actions, and horizon length $H$, and suppose that there are no more than a polynomial number of given reward functions of interest. By collecting an order of $\frac{SAH^3}{\varepsilon^2}$ sample episodes (up to log factor) without guidance of the reward information, our algorithm is able to find $\varepsilon$-optimal policies for all these reward functions, provided that $\varepsilon$ is sufficiently small. This forms the first reward-agnostic exploration scheme in this context that achieves provable minimax optimality. Furthermore, once the sample size exceeds $\frac{S^2AH^3}{\varepsilon^2}$ episodes (up to log factor), our algorithm is able to yield $\varepsilon$ accuracy for arbitrarily many reward functions (even when they are adversarially designed), a task commonly dubbed as “reward-free exploration.” The novelty of our algorithm design draws on insights from offline RL: the exploration scheme attempts to maximize a critical reward-agnostic quantity that dictates the performance of offline RL, while the policy learning paradigm leverages ideas from sample-optimal offline RL paradigms.
APA
Li, G., Yan, Y., Chen, Y. & Fan, J.. (2024). Minimax-optimal reward-agnostic exploration in reinforcement learning. Proceedings of Thirty Seventh Conference on Learning Theory, in Proceedings of Machine Learning Research 247:3431-3436 Available from https://proceedings.mlr.press/v247/li24a.html.

Related Material