Span-Agnostic Optimal Sample Complexity and Oracle Inequalities for Average-Reward RL

Matthew Zurek; Yudong Chen

Span-Agnostic Optimal Sample Complexity and Oracle Inequalities for Average-Reward RL

Matthew Zurek, Yudong Chen

Proceedings of Thirty Eighth Conference on Learning Theory, PMLR 291:6156-6209, 2025.

Abstract

We study the sample complexity of finding an $\varepsilon$-optimal policy in average-reward Markov Decision Processes (MDPs) with a generative model. The minimax optimal span-based complexity of $\widetilde{O}(SAH/\varepsilon^2)$, where $H$ is the span of the optimal bias function, has only been achievable with prior knowledge of the value of $H$. Prior-knowledge-free algorithms have been the objective of intensive research, but several natural approaches provably fail to achieve this goal. We resolve this problem, developing the first algorithms matching the optimal span-based complexity without $H$ knowledge, both when the dataset size is fixed and when the suboptimality level $\varepsilon$ is fixed. Our main technique combines the discounted reduction approach with a method for automatically tuning the effective horizon based on empirical confidence intervals or lower bounds on performance, which we term \textit{horizon calibration}. We also develop an \textit{empirical span penalization} approach, inspired by sample variance penalization, which satisfies an \textit{oracle inequality} performance guarantee. In particular this algorithm can outperform the minimax complexity in benign settings such as when there exist near-optimal policies with span much smaller than $H$.

Cite this Paper

BibTeX

@InProceedings{pmlr-v291-zurek25a,
  title = 	 {Span-Agnostic Optimal Sample Complexity and Oracle Inequalities for Average-Reward RL},
  author =       {Zurek, Matthew and Chen, Yudong},
  booktitle = 	 {Proceedings of Thirty Eighth Conference on Learning Theory},
  pages = 	 {6156--6209},
  year = 	 {2025},
  editor = 	 {Haghtalab, Nika and Moitra, Ankur},
  volume = 	 {291},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {30 Jun--04 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v291/main/assets/zurek25a/zurek25a.pdf},
  url = 	 {https://proceedings.mlr.press/v291/zurek25a.html},
  abstract = 	 {We study the sample complexity of finding an $\varepsilon$-optimal policy in average-reward Markov Decision Processes (MDPs) with a generative model. The minimax optimal span-based complexity of $\widetilde{O}(SAH/\varepsilon^2)$, where $H$ is the span of the optimal bias function, has only been achievable with prior knowledge of the value of $H$. Prior-knowledge-free algorithms have been the objective of intensive research, but several natural approaches provably fail to achieve this goal. We resolve this problem, developing the first algorithms matching the optimal span-based complexity without $H$ knowledge, both when the dataset size is fixed and when the suboptimality level $\varepsilon$ is fixed. Our main technique combines the discounted reduction approach with a method for automatically tuning the effective horizon based on empirical confidence intervals or lower bounds on performance, which we term \textit{horizon calibration}. We also develop an \textit{empirical span penalization} approach, inspired by sample variance penalization, which satisfies an \textit{oracle inequality} performance guarantee. In particular this algorithm can outperform the minimax complexity in benign settings such as when there exist near-optimal policies with span much smaller than $H$.}
}

Endnote

%0 Conference Paper
%T Span-Agnostic Optimal Sample Complexity and Oracle Inequalities for Average-Reward RL
%A Matthew Zurek
%A Yudong Chen
%B Proceedings of Thirty Eighth Conference on Learning Theory
%C Proceedings of Machine Learning Research
%D 2025
%E Nika Haghtalab
%E Ankur Moitra	
%F pmlr-v291-zurek25a
%I PMLR
%P 6156--6209
%U https://proceedings.mlr.press/v291/zurek25a.html
%V 291
%X We study the sample complexity of finding an $\varepsilon$-optimal policy in average-reward Markov Decision Processes (MDPs) with a generative model. The minimax optimal span-based complexity of $\widetilde{O}(SAH/\varepsilon^2)$, where $H$ is the span of the optimal bias function, has only been achievable with prior knowledge of the value of $H$. Prior-knowledge-free algorithms have been the objective of intensive research, but several natural approaches provably fail to achieve this goal. We resolve this problem, developing the first algorithms matching the optimal span-based complexity without $H$ knowledge, both when the dataset size is fixed and when the suboptimality level $\varepsilon$ is fixed. Our main technique combines the discounted reduction approach with a method for automatically tuning the effective horizon based on empirical confidence intervals or lower bounds on performance, which we term \textit{horizon calibration}. We also develop an \textit{empirical span penalization} approach, inspired by sample variance penalization, which satisfies an \textit{oracle inequality} performance guarantee. In particular this algorithm can outperform the minimax complexity in benign settings such as when there exist near-optimal policies with span much smaller than $H$.

APA

Zurek, M. & Chen, Y.. (2025). Span-Agnostic Optimal Sample Complexity and Oracle Inequalities for Average-Reward RL. Proceedings of Thirty Eighth Conference on Learning Theory, in Proceedings of Machine Learning Research 291:6156-6209 Available from https://proceedings.mlr.press/v291/zurek25a.html.

Related Material

Download PDF