Minimax-optimal reward-agnostic exploration in reinforcement learning

Gen Li; Yuling Yan; Yuxin Chen; Jianqing Fan

Minimax-optimal reward-agnostic exploration in reinforcement learning

Gen Li, Yuling Yan, Yuxin Chen, Jianqing Fan

Proceedings of Thirty Seventh Conference on Learning Theory, PMLR 247:3431-3436, 2024.

Abstract

This paper studies reward-agnostic exploration in reinforcement learning (RL) — a scenario where the learner is unware of the reward functions during the exploration stage — and designs an algorithm that improves over the state of the art. More precisely, consider a finite-horizon inhomogeneous Markov decision process with

$S$ states,

$A$ actions, and horizon length

$H$ , and suppose that there are no more than a polynomial number of given reward functions of interest. By collecting an order of

$\frac{SAH^3}{\varepsilon^2}$ sample episodes (up to log factor) without guidance of the reward information, our algorithm is able to find

$\varepsilon$ -optimal policies for all these reward functions, provided that

$\varepsilon$ is sufficiently small. This forms the first reward-agnostic exploration scheme in this context that achieves provable minimax optimality. Furthermore, once the sample size exceeds

$\frac{S^2AH^3}{\varepsilon^2}$ episodes (up to log factor), our algorithm is able to yield

$\varepsilon$ accuracy for arbitrarily many reward functions (even when they are adversarially designed), a task commonly dubbed as “reward-free exploration.” The novelty of our algorithm design draws on insights from offline RL: the exploration scheme attempts to maximize a critical reward-agnostic quantity that dictates the performance of offline RL, while the policy learning paradigm leverages ideas from sample-optimal offline RL paradigms.

Cite this Paper

BibTeX


@InProceedings{pmlr-v247-li24a,
  title = 	 {Minimax-optimal reward-agnostic exploration in reinforcement learning},
  author =       {Li, Gen and Yan, Yuling and Chen, Yuxin and Fan, Jianqing},
  booktitle = 	 {Proceedings of Thirty Seventh Conference on Learning Theory},
  pages = 	 {3431--3436},
  year = 	 {2024},
  editor = 	 {Agrawal, Shipra and Roth, Aaron},
  volume = 	 {247},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {30 Jun--03 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v247/li24a/li24a.pdf},
  url = 	 {https://proceedings.mlr.press/v247/li24a.html},
  abstract = 	 {This paper studies reward-agnostic exploration in reinforcement learning (RL) — a scenario where the learner is unware of the reward functions during the exploration stage — and designs an algorithm that improves over the state of the art. More precisely, consider a finite-horizon inhomogeneous Markov decision process with $S$ states, $A$ actions, and horizon length $H$, and suppose that there are no more than a polynomial number of given reward functions of interest. By collecting an order of $\frac{SAH^3}{\varepsilon^2}$ sample episodes (up to log factor) without guidance of the reward information, our algorithm is able to find $\varepsilon$-optimal policies for all these reward functions, provided that $\varepsilon$ is sufficiently small. This forms the first reward-agnostic exploration scheme in this context that achieves provable minimax optimality. Furthermore, once the sample size exceeds $\frac{S^2AH^3}{\varepsilon^2}$ episodes (up to log factor), our algorithm is able to yield $\varepsilon$ accuracy for arbitrarily many reward functions (even when they are adversarially designed),  a task commonly dubbed as “reward-free exploration.” The novelty of our algorithm design draws on insights from offline RL: the exploration scheme attempts to maximize a critical reward-agnostic quantity that dictates the performance of offline RL, while the policy learning paradigm leverages ideas from sample-optimal offline RL paradigms.  }
}

Endnote

%0 Conference Paper
%T Minimax-optimal reward-agnostic exploration in reinforcement learning
%A Gen Li
%A Yuling Yan
%A Yuxin Chen
%A Jianqing Fan
%B Proceedings of Thirty Seventh Conference on Learning Theory
%C Proceedings of Machine Learning Research
%D 2024
%E Shipra Agrawal
%E Aaron Roth	
%F pmlr-v247-li24a
%I PMLR
%P 3431--3436
%U https://proceedings.mlr.press/v247/li24a.html
%V 247
%X This paper studies reward-agnostic exploration in reinforcement learning (RL) — a scenario where the learner is unware of the reward functions during the exploration stage — and designs an algorithm that improves over the state of the art. More precisely, consider a finite-horizon inhomogeneous Markov decision process with $S$ states, $A$ actions, and horizon length $H$, and suppose that there are no more than a polynomial number of given reward functions of interest. By collecting an order of $\frac{SAH^3}{\varepsilon^2}$ sample episodes (up to log factor) without guidance of the reward information, our algorithm is able to find $\varepsilon$-optimal policies for all these reward functions, provided that $\varepsilon$ is sufficiently small. This forms the first reward-agnostic exploration scheme in this context that achieves provable minimax optimality. Furthermore, once the sample size exceeds $\frac{S^2AH^3}{\varepsilon^2}$ episodes (up to log factor), our algorithm is able to yield $\varepsilon$ accuracy for arbitrarily many reward functions (even when they are adversarially designed),  a task commonly dubbed as “reward-free exploration.” The novelty of our algorithm design draws on insights from offline RL: the exploration scheme attempts to maximize a critical reward-agnostic quantity that dictates the performance of offline RL, while the policy learning paradigm leverages ideas from sample-optimal offline RL paradigms.

APA


Li, G., Yan, Y., Chen, Y. & Fan, J.. (2024). Minimax-optimal reward-agnostic exploration in reinforcement learning. Proceedings of Thirty Seventh Conference on Learning Theory, in Proceedings of Machine Learning Research 247:3431-3436 Available from https://proceedings.mlr.press/v247/li24a.html.

Related Material

Download PDF