Optimal Variance-Dependent Regret Bounds for Infinite-Horizon MDPs

Guy Zamir; Matthew Zurek; Yudong Chen

Optimal Variance-Dependent Regret Bounds for Infinite-Horizon MDPs

Guy Zamir, Matthew Zurek, Yudong Chen

Proceedings of Thirty Ninth Conference on Learning Theory, PMLR 336:7016-7061, 2026.

Abstract

Online reinforcement learning in infinite-horizon Markov decision processes (MDPs) remains less theoretically and algorithmically developed than its episodic counterpart, with many algorithms suffering from high “burn-in” costs and failing to adapt to benign instance-specific complexity. In this work, we address these shortcomings for two infinite-horizon objectives: the classical average-reward regret and the $\gamma$-regret. We develop a single tractable UCB-style algorithm applicable to both settings, which achieves the first optimal variance-dependent regret guarantees. Our regret bounds in both settings take the form $\widetilde{O}( \sqrt{SA\,\text{Var}} + \text{lower-order terms})$, where $S,A$ are the state and action space sizes, and $\text{Var}$ captures cumulative transition variance. This implies minimax-optimal average-reward and $\gamma$-regret bounds in the worst case but also adapts to easier problem instances, for example yielding nearly constant regret in deterministic MDPs. Furthermore, our algorithm enjoys significantly improved lower-order terms for the average-reward setting. With prior knowledge of the optimal bias span $\|h^\star\|_{\mathrm{sp}}$, our algorithm obtains lower-order terms scaling as $\|h^\star\|_{\mathrm{sp}}S^2 A$, which we prove is optimal in both $\|h^\star\|_{\mathrm{sp}}$ and $A$. Without prior knowledge, we prove that no algorithm can have lower-order terms smaller than $\|h^\star\|_{\mathrm{sp}}^2SA$, and we provide a prior-free algorithm whose lower-order terms scale as $\|h^\star\|_{\mathrm{sp}}^2S^3A$, nearly matching this lower bound. Taken together, these results completely characterize the optimal dependence on $\|h^\star\|_{\mathrm{sp}}$ in both leading and lower-order terms, and reveal a fundamental gap in what is achievable with and without prior knowledge.

Cite this Paper

BibTeX

@InProceedings{pmlr-v336-zamir26a,
  title = 	 {Optimal Variance-Dependent Regret Bounds for Infinite-Horizon MDPs},
  author =       {Zamir, Guy and Zurek, Matthew and Chen, Yudong},
  booktitle = 	 {Proceedings of Thirty Ninth Conference on Learning Theory},
  pages = 	 {7016--7061},
  year = 	 {2026},
  editor = 	 {Hanneke, Steve and Lattimore, Tor},
  volume = 	 {336},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {29 Jun--03 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v336/main/assets/zamir26a/zamir26a.pdf},
  url = 	 {https://proceedings.mlr.press/v336/zamir26a.html},
  abstract = 	 {Online reinforcement learning in infinite-horizon Markov decision processes (MDPs) remains less theoretically and algorithmically developed than its episodic counterpart, with many algorithms suffering from high “burn-in” costs and failing to adapt to benign instance-specific complexity. In this work, we address these shortcomings for two infinite-horizon objectives: the classical average-reward regret and the $\gamma$-regret. We develop a single tractable UCB-style algorithm applicable to both settings, which achieves the first optimal variance-dependent regret guarantees. Our regret bounds in both settings take the form $\widetilde{O}( \sqrt{SA\,\text{Var}} + \text{lower-order terms})$, where $S,A$ are the state and action space sizes, and $\text{Var}$ captures cumulative transition variance. This implies minimax-optimal average-reward and $\gamma$-regret bounds in the worst case but also adapts to easier problem instances, for example yielding nearly constant regret in deterministic MDPs. Furthermore, our algorithm enjoys significantly improved lower-order terms for the average-reward setting. With prior knowledge of the optimal bias span $\|h^\star\|_{\mathrm{sp}}$, our algorithm obtains lower-order terms scaling as $\|h^\star\|_{\mathrm{sp}}S^2 A$, which we prove is optimal in both $\|h^\star\|_{\mathrm{sp}}$ and $A$.  Without prior knowledge, we prove that no algorithm can have lower-order terms smaller than $\|h^\star\|_{\mathrm{sp}}^2SA$, and we provide a prior-free algorithm whose lower-order terms scale as $\|h^\star\|_{\mathrm{sp}}^2S^3A$, nearly matching this lower bound. Taken together, these results completely characterize the optimal dependence on $\|h^\star\|_{\mathrm{sp}}$ in both leading and lower-order terms, and reveal a fundamental gap in what is achievable with and without prior knowledge.}
}

Endnote

%0 Conference Paper
%T Optimal Variance-Dependent Regret Bounds for Infinite-Horizon MDPs
%A Guy Zamir
%A Matthew Zurek
%A Yudong Chen
%B Proceedings of Thirty Ninth Conference on Learning Theory
%C Proceedings of Machine Learning Research
%D 2026
%E Steve Hanneke
%E Tor Lattimore	
%F pmlr-v336-zamir26a
%I PMLR
%P 7016--7061
%U https://proceedings.mlr.press/v336/zamir26a.html
%V 336
%X Online reinforcement learning in infinite-horizon Markov decision processes (MDPs) remains less theoretically and algorithmically developed than its episodic counterpart, with many algorithms suffering from high “burn-in” costs and failing to adapt to benign instance-specific complexity. In this work, we address these shortcomings for two infinite-horizon objectives: the classical average-reward regret and the $\gamma$-regret. We develop a single tractable UCB-style algorithm applicable to both settings, which achieves the first optimal variance-dependent regret guarantees. Our regret bounds in both settings take the form $\widetilde{O}( \sqrt{SA\,\text{Var}} + \text{lower-order terms})$, where $S,A$ are the state and action space sizes, and $\text{Var}$ captures cumulative transition variance. This implies minimax-optimal average-reward and $\gamma$-regret bounds in the worst case but also adapts to easier problem instances, for example yielding nearly constant regret in deterministic MDPs. Furthermore, our algorithm enjoys significantly improved lower-order terms for the average-reward setting. With prior knowledge of the optimal bias span $\|h^\star\|_{\mathrm{sp}}$, our algorithm obtains lower-order terms scaling as $\|h^\star\|_{\mathrm{sp}}S^2 A$, which we prove is optimal in both $\|h^\star\|_{\mathrm{sp}}$ and $A$.  Without prior knowledge, we prove that no algorithm can have lower-order terms smaller than $\|h^\star\|_{\mathrm{sp}}^2SA$, and we provide a prior-free algorithm whose lower-order terms scale as $\|h^\star\|_{\mathrm{sp}}^2S^3A$, nearly matching this lower bound. Taken together, these results completely characterize the optimal dependence on $\|h^\star\|_{\mathrm{sp}}$ in both leading and lower-order terms, and reveal a fundamental gap in what is achievable with and without prior knowledge.

APA

Zamir, G., Zurek, M. & Chen, Y.. (2026). Optimal Variance-Dependent Regret Bounds for Infinite-Horizon MDPs. Proceedings of Thirty Ninth Conference on Learning Theory, in Proceedings of Machine Learning Research 336:7016-7061 Available from https://proceedings.mlr.press/v336/zamir26a.html.

Related Material

Download PDF