Taming the Monster Every Context: Complexity Measure and Unified Framework for Offline-Oracle Efficient Contextual Bandits

Hao Qin; Chicheng Zhang

Taming the Monster Every Context: Complexity Measure and Unified Framework for Offline-Oracle Efficient Contextual Bandits

Hao Qin, Chicheng Zhang

Proceedings of Thirty Ninth Conference on Learning Theory, PMLR 336:5399-5464, 2026.

Abstract

We propose an algorithmic framework, Offline Estimation to Decisions (OE2D), that reduces contextual bandit learning with general reward function approximation to offline regression. The framework allows near-optimal regret for contextual bandits with large action spaces with $O(\log T)$ calls to an offline regression oracle over $T$ rounds, and makes $O(\log\log T)$ calls when $T$ is known. The design of OE2D generalizes Falcon and its linear-reward version in that it chooses an action distribution that we term the “exploitative F-design” that simultaneously guarantees low regret and good coverage that trades off exploration and exploitation. Central to our regret analysis is a new complexity measure, the Decision-Offline Estimation Coefficient (DOEC), which we show is bounded in the bounded Eluder dimension per-context and smoothed regret settings. We also establish a relationship between DOEC and the Decision Estimation Coefficient (DEC), bridging the design principles of offline- and online-oracle efficient contextual bandit algorithms for the first time.

Cite this Paper

BibTeX

@InProceedings{pmlr-v336-qin26a,
  title = 	 {Taming the Monster Every Context: Complexity Measure and Unified Framework for Offline-Oracle Efficient Contextual Bandits},
  author =       {Qin, Hao and Zhang, Chicheng},
  booktitle = 	 {Proceedings of Thirty Ninth Conference on Learning Theory},
  pages = 	 {5399--5464},
  year = 	 {2026},
  editor = 	 {Hanneke, Steve and Lattimore, Tor},
  volume = 	 {336},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {29 Jun--03 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v336/main/assets/qin26a/qin26a.pdf},
  url = 	 {https://proceedings.mlr.press/v336/qin26a.html},
  abstract = 	 {We propose an algorithmic framework, Offline Estimation to Decisions (OE2D), that reduces contextual bandit learning with general reward function approximation to offline regression. The framework allows near-optimal regret for contextual bandits with large action spaces with $O(\log T)$ calls to an offline regression oracle over $T$ rounds, and makes $O(\log\log T)$ calls when $T$ is known. The design of OE2D generalizes Falcon and its linear-reward version in that it chooses an action distribution that we term the “exploitative F-design” that simultaneously guarantees low regret and good coverage that trades off exploration and exploitation. Central to our regret analysis is a new complexity measure, the Decision-Offline Estimation Coefficient (DOEC), which we show is bounded in the bounded Eluder dimension per-context and smoothed regret settings. We also establish a relationship between DOEC and the Decision Estimation Coefficient (DEC), bridging the design principles of offline- and online-oracle efficient contextual bandit algorithms for the first time.}
}

Endnote

%0 Conference Paper
%T Taming the Monster Every Context: Complexity Measure and Unified Framework for Offline-Oracle Efficient Contextual Bandits
%A Hao Qin
%A Chicheng Zhang
%B Proceedings of Thirty Ninth Conference on Learning Theory
%C Proceedings of Machine Learning Research
%D 2026
%E Steve Hanneke
%E Tor Lattimore	
%F pmlr-v336-qin26a
%I PMLR
%P 5399--5464
%U https://proceedings.mlr.press/v336/qin26a.html
%V 336
%X We propose an algorithmic framework, Offline Estimation to Decisions (OE2D), that reduces contextual bandit learning with general reward function approximation to offline regression. The framework allows near-optimal regret for contextual bandits with large action spaces with $O(\log T)$ calls to an offline regression oracle over $T$ rounds, and makes $O(\log\log T)$ calls when $T$ is known. The design of OE2D generalizes Falcon and its linear-reward version in that it chooses an action distribution that we term the “exploitative F-design” that simultaneously guarantees low regret and good coverage that trades off exploration and exploitation. Central to our regret analysis is a new complexity measure, the Decision-Offline Estimation Coefficient (DOEC), which we show is bounded in the bounded Eluder dimension per-context and smoothed regret settings. We also establish a relationship between DOEC and the Decision Estimation Coefficient (DEC), bridging the design principles of offline- and online-oracle efficient contextual bandit algorithms for the first time.

APA

Qin, H. & Zhang, C.. (2026). Taming the Monster Every Context: Complexity Measure and Unified Framework for Offline-Oracle Efficient Contextual Bandits. Proceedings of Thirty Ninth Conference on Learning Theory, in Proceedings of Machine Learning Research 336:5399-5464 Available from https://proceedings.mlr.press/v336/qin26a.html.

Related Material

Download PDF