Taming the Monster Every Context: Complexity Measure and Unified Framework for Offline-Oracle Efficient Contextual Bandits

Hao Qin, Chicheng Zhang
Proceedings of Thirty Ninth Conference on Learning Theory, PMLR 336:5399-5464, 2026.

Abstract

We propose an algorithmic framework, Offline Estimation to Decisions (OE2D), that reduces contextual bandit learning with general reward function approximation to offline regression. The framework allows near-optimal regret for contextual bandits with large action spaces with $O(\log T)$ calls to an offline regression oracle over $T$ rounds, and makes $O(\log\log T)$ calls when $T$ is known. The design of OE2D generalizes Falcon and its linear-reward version in that it chooses an action distribution that we term the “exploitative F-design” that simultaneously guarantees low regret and good coverage that trades off exploration and exploitation. Central to our regret analysis is a new complexity measure, the Decision-Offline Estimation Coefficient (DOEC), which we show is bounded in the bounded Eluder dimension per-context and smoothed regret settings. We also establish a relationship between DOEC and the Decision Estimation Coefficient (DEC), bridging the design principles of offline- and online-oracle efficient contextual bandit algorithms for the first time.

Cite this Paper


BibTeX
@InProceedings{pmlr-v336-qin26a, title = {Taming the Monster Every Context: Complexity Measure and Unified Framework for Offline-Oracle Efficient Contextual Bandits}, author = {Qin, Hao and Zhang, Chicheng}, booktitle = {Proceedings of Thirty Ninth Conference on Learning Theory}, pages = {5399--5464}, year = {2026}, editor = {Hanneke, Steve and Lattimore, Tor}, volume = {336}, series = {Proceedings of Machine Learning Research}, month = {29 Jun--03 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v336/main/assets/qin26a/qin26a.pdf}, url = {https://proceedings.mlr.press/v336/qin26a.html}, abstract = {We propose an algorithmic framework, Offline Estimation to Decisions (OE2D), that reduces contextual bandit learning with general reward function approximation to offline regression. The framework allows near-optimal regret for contextual bandits with large action spaces with $O(\log T)$ calls to an offline regression oracle over $T$ rounds, and makes $O(\log\log T)$ calls when $T$ is known. The design of OE2D generalizes Falcon and its linear-reward version in that it chooses an action distribution that we term the “exploitative F-design” that simultaneously guarantees low regret and good coverage that trades off exploration and exploitation. Central to our regret analysis is a new complexity measure, the Decision-Offline Estimation Coefficient (DOEC), which we show is bounded in the bounded Eluder dimension per-context and smoothed regret settings. We also establish a relationship between DOEC and the Decision Estimation Coefficient (DEC), bridging the design principles of offline- and online-oracle efficient contextual bandit algorithms for the first time.} }
Endnote
%0 Conference Paper %T Taming the Monster Every Context: Complexity Measure and Unified Framework for Offline-Oracle Efficient Contextual Bandits %A Hao Qin %A Chicheng Zhang %B Proceedings of Thirty Ninth Conference on Learning Theory %C Proceedings of Machine Learning Research %D 2026 %E Steve Hanneke %E Tor Lattimore %F pmlr-v336-qin26a %I PMLR %P 5399--5464 %U https://proceedings.mlr.press/v336/qin26a.html %V 336 %X We propose an algorithmic framework, Offline Estimation to Decisions (OE2D), that reduces contextual bandit learning with general reward function approximation to offline regression. The framework allows near-optimal regret for contextual bandits with large action spaces with $O(\log T)$ calls to an offline regression oracle over $T$ rounds, and makes $O(\log\log T)$ calls when $T$ is known. The design of OE2D generalizes Falcon and its linear-reward version in that it chooses an action distribution that we term the “exploitative F-design” that simultaneously guarantees low regret and good coverage that trades off exploration and exploitation. Central to our regret analysis is a new complexity measure, the Decision-Offline Estimation Coefficient (DOEC), which we show is bounded in the bounded Eluder dimension per-context and smoothed regret settings. We also establish a relationship between DOEC and the Decision Estimation Coefficient (DEC), bridging the design principles of offline- and online-oracle efficient contextual bandit algorithms for the first time.
APA
Qin, H. & Zhang, C.. (2026). Taming the Monster Every Context: Complexity Measure and Unified Framework for Offline-Oracle Efficient Contextual Bandits. Proceedings of Thirty Ninth Conference on Learning Theory, in Proceedings of Machine Learning Research 336:5399-5464 Available from https://proceedings.mlr.press/v336/qin26a.html.

Related Material