[edit]
Taming the Monster Every Context: Complexity Measure and Unified Framework for Offline-Oracle Efficient Contextual Bandits
Proceedings of Thirty Ninth Conference on Learning Theory, PMLR 336:5399-5464, 2026.
Abstract
We propose an algorithmic framework, Offline Estimation to Decisions (OE2D), that reduces contextual bandit learning with general reward function approximation to offline regression. The framework allows near-optimal regret for contextual bandits with large action spaces with $O(\log T)$ calls to an offline regression oracle over $T$ rounds, and makes $O(\log\log T)$ calls when $T$ is known. The design of OE2D generalizes Falcon and its linear-reward version in that it chooses an action distribution that we term the “exploitative F-design” that simultaneously guarantees low regret and good coverage that trades off exploration and exploitation. Central to our regret analysis is a new complexity measure, the Decision-Offline Estimation Coefficient (DOEC), which we show is bounded in the bounded Eluder dimension per-context and smoothed regret settings. We also establish a relationship between DOEC and the Decision Estimation Coefficient (DEC), bridging the design principles of offline- and online-oracle efficient contextual bandit algorithms for the first time.