[edit]
Stochastic Constrained Contextual Bandits via Lyapunov Optimization Based Estimation to Decision Framework
Proceedings of Thirty Seventh Conference on Learning Theory, PMLR 247:2204-2231, 2024.
Abstract
This paper studies the problem of stochastic constrained contextual bandits (CCB) under general realizability condition where the expected rewards and costs are within general function classes. We propose LOE2D, a Lyapunov Optimization Based Estimation to Decision framework with online regression oracles for learning reward/constraint. LOE2D establishes $\Tilde O(T^{\frac{3}{4}}U^{\frac{1}{4}})$ regret and constraint violation, which can be further refined to $\Tilde O(\min\{\sqrt{TU}/\varepsilon^2, T^{\frac{3}{4}}U^{\frac{1}{4}}\})$ when the Slater condition holds in the underlying offline problem with the Slater “constant” $ \varepsilon=\Omega(\sqrt{U/T}),$ where $U$ denotes the error bounds of online regression oracles. These results improve LagrangeCBwLC in two aspects: i) our results hold without any prior information while LagrangeCBwLC requires the knowledge of Slater constant to design a proper learning rate; ii) our results hold when $\varepsilon=\Omega(\sqrt{U/T})$ while LagrangeCBwLC requires a constant margin $\varepsilon=\Omega(1).$ These improvements stem from two novel techniques: violation-adaptive learning in E2D module and multi-step Lyapunov drift analysis in bounding constraint violation. The experiments further justify LOE2D outperforms the baseline algorithm.