Tight Sample Complexity Bounds for Entropic Best Policy Identification

Amer Essakine, Claire Vernade
Proceedings of Thirty Ninth Conference on Learning Theory, PMLR 336:2339-2398, 2026.

Abstract

We study best-policy identification for finite-horizon risk-sensitive reinforcement learning under the entropic risk measure. Recent work established a constant gap in the exponential horizon dependence between lower and upper bounds on the number of samples required to identify an approximately optimal policy. Precisely, known lower bounds scale in $\Omega(\exp(|\beta|H))$ where $H$ is the horizon of the MDP, while the state-of-the-art upper bound achieves at best $O(\exp(2|\beta|H))$ (Mortensen and Talebi, 2025) using a generative model. We show that this extra exponential factor can be traced to overly loose concentration control for exponential utilities. To close this open gap, we revisit the analysis of this problem through a forward-model-based algorithm building on KL-based exploration bonuses that we adapt to the entropic criterion. The improvement we get is due to two main novel technical innovations. We leverage the smoothness properties of the exponential utility to derive sharper concentration bounds, and we propose a new stopping rule that exploits further this tightness to obtain a sample complexity that matches the lower bound.

Cite this Paper


BibTeX
@InProceedings{pmlr-v336-essakine26a, title = {Tight Sample Complexity Bounds for Entropic Best Policy Identification}, author = {Essakine, Amer and Vernade, Claire}, booktitle = {Proceedings of Thirty Ninth Conference on Learning Theory}, pages = {2339--2398}, year = {2026}, editor = {Hanneke, Steve and Lattimore, Tor}, volume = {336}, series = {Proceedings of Machine Learning Research}, month = {29 Jun--03 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v336/main/assets/essakine26a/essakine26a.pdf}, url = {https://proceedings.mlr.press/v336/essakine26a.html}, abstract = {We study best-policy identification for finite-horizon risk-sensitive reinforcement learning under the entropic risk measure. Recent work established a constant gap in the exponential horizon dependence between lower and upper bounds on the number of samples required to identify an approximately optimal policy. Precisely, known lower bounds scale in $\Omega(\exp(|\beta|H))$ where $H$ is the horizon of the MDP, while the state-of-the-art upper bound achieves at best $O(\exp(2|\beta|H))$ (Mortensen and Talebi, 2025) using a generative model. We show that this extra exponential factor can be traced to overly loose concentration control for exponential utilities. To close this open gap, we revisit the analysis of this problem through a forward-model-based algorithm building on KL-based exploration bonuses that we adapt to the entropic criterion. The improvement we get is due to two main novel technical innovations. We leverage the smoothness properties of the exponential utility to derive sharper concentration bounds, and we propose a new stopping rule that exploits further this tightness to obtain a sample complexity that matches the lower bound.} }
Endnote
%0 Conference Paper %T Tight Sample Complexity Bounds for Entropic Best Policy Identification %A Amer Essakine %A Claire Vernade %B Proceedings of Thirty Ninth Conference on Learning Theory %C Proceedings of Machine Learning Research %D 2026 %E Steve Hanneke %E Tor Lattimore %F pmlr-v336-essakine26a %I PMLR %P 2339--2398 %U https://proceedings.mlr.press/v336/essakine26a.html %V 336 %X We study best-policy identification for finite-horizon risk-sensitive reinforcement learning under the entropic risk measure. Recent work established a constant gap in the exponential horizon dependence between lower and upper bounds on the number of samples required to identify an approximately optimal policy. Precisely, known lower bounds scale in $\Omega(\exp(|\beta|H))$ where $H$ is the horizon of the MDP, while the state-of-the-art upper bound achieves at best $O(\exp(2|\beta|H))$ (Mortensen and Talebi, 2025) using a generative model. We show that this extra exponential factor can be traced to overly loose concentration control for exponential utilities. To close this open gap, we revisit the analysis of this problem through a forward-model-based algorithm building on KL-based exploration bonuses that we adapt to the entropic criterion. The improvement we get is due to two main novel technical innovations. We leverage the smoothness properties of the exponential utility to derive sharper concentration bounds, and we propose a new stopping rule that exploits further this tightness to obtain a sample complexity that matches the lower bound.
APA
Essakine, A. & Vernade, C.. (2026). Tight Sample Complexity Bounds for Entropic Best Policy Identification. Proceedings of Thirty Ninth Conference on Learning Theory, in Proceedings of Machine Learning Research 336:2339-2398 Available from https://proceedings.mlr.press/v336/essakine26a.html.

Related Material