Unified Framework of Distributional Regret in Multi-Armed Bandits and Reinforcement Learning

Harin Lee; Min-hwan Oh

Unified Framework of Distributional Regret in Multi-Armed Bandits and Reinforcement Learning

Harin Lee, Min-hwan Oh

Proceedings of Thirty Ninth Conference on Learning Theory, PMLR 336:4520-4584, 2026.

Abstract

We study the distribution of regret in stochastic multi-armed bandits and episodic reinforcement learning through a unified framework. We formalize a \emph{distributional regret bound} as a probabilistic guarantee that holds \emph{uniformly} over all confidence levels $\delta \in (0,1]$, thereby characterizing the regret distribution across the full range of $\delta$. We present a simple UCBVI-style algorithm with exploration bonus $\min{c_{1,k}/N, c_{2,k}/\sqrt{N}}$, where $N$ denotes the visit count and $(c_{1,k},c_{2,k})$ are user-specified parameters. For arbitrary parameter sequences, we derive general gap-independent and gap-dependent distributional regret bounds, yielding a principled characterization of how the parameters control the trade-off between expected performance, tail risk, and instance-dependent behavior. In particular, our bounds achieve optimal trade-offs between expected and distributional regret in both minimax and instance-dependent regimes. As a special case, for multi-armed bandits with $A$ arms and horizon $T$, we obtain a distributional regret bound of order $\mathcal{O}\big(\sqrt{AT}\log(1/\delta)\big)$, confirming the conjecture of Lattimore and Szepesvári (2020, Section 17.1) for the first time.

Cite this Paper

BibTeX

@InProceedings{pmlr-v336-lee26b,
  title = 	 {Unified Framework of Distributional Regret in Multi-Armed Bandits and Reinforcement Learning},
  author =       {Lee, Harin and Oh, Min{-}hwan},
  booktitle = 	 {Proceedings of Thirty Ninth Conference on Learning Theory},
  pages = 	 {4520--4584},
  year = 	 {2026},
  editor = 	 {Hanneke, Steve and Lattimore, Tor},
  volume = 	 {336},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {29 Jun--03 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v336/main/assets/lee26b/lee26b.pdf},
  url = 	 {https://proceedings.mlr.press/v336/lee26b.html},
  abstract = 	 {We study the distribution of regret in stochastic multi-armed bandits and episodic reinforcement learning through a unified framework. We formalize a \emph{distributional regret bound} as a probabilistic guarantee that holds \emph{uniformly} over all confidence levels $\delta \in (0,1]$, thereby characterizing the regret distribution across the full range of $\delta$. We present a simple UCBVI-style algorithm with exploration bonus $\min{c_{1,k}/N, c_{2,k}/\sqrt{N}}$, where $N$ denotes the visit count and $(c_{1,k},c_{2,k})$ are user-specified parameters. For arbitrary parameter sequences, we derive general gap-independent and gap-dependent distributional regret bounds, yielding a principled characterization of how the parameters control the trade-off between expected performance, tail risk, and instance-dependent behavior. In particular, our bounds achieve optimal trade-offs between expected and distributional regret in both minimax and instance-dependent regimes. As a special case, for multi-armed bandits with $A$ arms and horizon $T$, we obtain a distributional regret bound of order $\mathcal{O}\big(\sqrt{AT}\log(1/\delta)\big)$, confirming the conjecture of Lattimore and Szepesvári (2020, Section 17.1) for the first time.}
}

Endnote

%0 Conference Paper
%T Unified Framework of Distributional Regret in Multi-Armed Bandits and Reinforcement Learning
%A Harin Lee
%A Min-hwan Oh
%B Proceedings of Thirty Ninth Conference on Learning Theory
%C Proceedings of Machine Learning Research
%D 2026
%E Steve Hanneke
%E Tor Lattimore	
%F pmlr-v336-lee26b
%I PMLR
%P 4520--4584
%U https://proceedings.mlr.press/v336/lee26b.html
%V 336
%X We study the distribution of regret in stochastic multi-armed bandits and episodic reinforcement learning through a unified framework. We formalize a \emph{distributional regret bound} as a probabilistic guarantee that holds \emph{uniformly} over all confidence levels $\delta \in (0,1]$, thereby characterizing the regret distribution across the full range of $\delta$. We present a simple UCBVI-style algorithm with exploration bonus $\min{c_{1,k}/N, c_{2,k}/\sqrt{N}}$, where $N$ denotes the visit count and $(c_{1,k},c_{2,k})$ are user-specified parameters. For arbitrary parameter sequences, we derive general gap-independent and gap-dependent distributional regret bounds, yielding a principled characterization of how the parameters control the trade-off between expected performance, tail risk, and instance-dependent behavior. In particular, our bounds achieve optimal trade-offs between expected and distributional regret in both minimax and instance-dependent regimes. As a special case, for multi-armed bandits with $A$ arms and horizon $T$, we obtain a distributional regret bound of order $\mathcal{O}\big(\sqrt{AT}\log(1/\delta)\big)$, confirming the conjecture of Lattimore and Szepesvári (2020, Section 17.1) for the first time.

APA

Lee, H. & Oh, M.. (2026). Unified Framework of Distributional Regret in Multi-Armed Bandits and Reinforcement Learning. Proceedings of Thirty Ninth Conference on Learning Theory, in Proceedings of Machine Learning Research 336:4520-4584 Available from https://proceedings.mlr.press/v336/lee26b.html.

Related Material

Download PDF