[edit]
Stochastic Constrained Contextual Bandits via Lyapunov Optimization Based Estimation to Decision Framework
Proceedings of Thirty Seventh Conference on Learning Theory, PMLR 247:2204-2231, 2024.
Abstract
This paper studies the problem of stochastic constrained contextual bandits (CCB) under general realizability condition where the expected rewards and costs are within general function classes. We propose LOE2D, a Lyapunov Optimization Based Estimation to Decision framework with online regression oracles for learning reward/constraint. LOE2D establishes \TildeO(T34U14) regret and constraint violation, which can be further refined to \TildeO(min when the Slater condition holds in the underlying offline problem with the Slater “constant” \varepsilon=\Omega(\sqrt{U/T}), where U denotes the error bounds of online regression oracles. These results improve LagrangeCBwLC in two aspects: i) our results hold without any prior information while LagrangeCBwLC requires the knowledge of Slater constant to design a proper learning rate; ii) our results hold when \varepsilon=\Omega(\sqrt{U/T}) while LagrangeCBwLC requires a constant margin \varepsilon=\Omega(1). These improvements stem from two novel techniques: violation-adaptive learning in E2D module and multi-step Lyapunov drift analysis in bounding constraint violation. The experiments further justify LOE2D outperforms the baseline algorithm.