Safe Linear Bandits over Unknown Polytopes

Aditya Gangrade, Tianrui Chen, Venkatesh Saligrama
Proceedings of Thirty Seventh Conference on Learning Theory, PMLR 247:1755-1795, 2024.

Abstract

The safe linear bandit problem (SLB) is an online approach to linear programming with unknown objective and unknown \emph{roundwise} constraints, under stochastic bandit feedback of rewards and safety risks of actions. We study the tradeoffs between efficacy and smooth safety costs of SLBs over polytopes, and the role of aggressive {doubly-optimistic play} in avoiding the strong assumptions made by extant pessimistic-optimistic approaches. We first elucidate an inherent hardness in SLBs due the lack of knowledge of constraints: there exist ‘easy’ instances, for which suboptimal extreme points have large ‘gaps’, but on which SLB methods must still incur $\Omega(\sqrt{T})$ regret or safety violations, due to an inability to resolve unknown optima to arbitrary precision. We then analyse a natural doubly-optimistic strategy for the safe linear bandit problem, \textsc{doss}, which uses optimistic estimates of both reward and safety risks to select actions, and show that despite the lack of knowledge of constraints or feasible points, \textsc{doss} simultaneously obtains tight instance-dependent $O(\log^2 T)$ bounds on efficacy regret, and $\widetilde O(\sqrt{T})$ bounds on safety violations, thus attaining near Pareto-optimality. Further, when safety is demanded to a finite precision, violations improve to $O(\log^2 T).$ These results rely on a novel dual analysis of linear bandits: we argue that \textsc{doss} proceeds by activating noisy versions of at least $d$ constraints in each round, which allows us to separately analyse rounds where a ‘poor’ set of constraints is activated, and rounds where ‘good’ sets of constraints are activated. The costs in the former are controlled to $O(\log^2 T)$ by developing new dual notions of gaps, based on global sensitivity analyses of linear programs, that quantify the suboptimality of each such set of constraints. The latter costs are controlled to $O(1)$ by explicitly analysing the solutions of optimistic play.

Cite this Paper


BibTeX
@InProceedings{pmlr-v247-gangrade24a, title = {Safe Linear Bandits over Unknown Polytopes}, author = {Gangrade, Aditya and Chen, Tianrui and Saligrama, Venkatesh}, booktitle = {Proceedings of Thirty Seventh Conference on Learning Theory}, pages = {1755--1795}, year = {2024}, editor = {Agrawal, Shipra and Roth, Aaron}, volume = {247}, series = {Proceedings of Machine Learning Research}, month = {30 Jun--03 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v247/gangrade24a/gangrade24a.pdf}, url = {https://proceedings.mlr.press/v247/gangrade24a.html}, abstract = {The safe linear bandit problem (SLB) is an online approach to linear programming with unknown objective and unknown \emph{roundwise} constraints, under stochastic bandit feedback of rewards and safety risks of actions. We study the tradeoffs between efficacy and smooth safety costs of SLBs over polytopes, and the role of aggressive {doubly-optimistic play} in avoiding the strong assumptions made by extant pessimistic-optimistic approaches. We first elucidate an inherent hardness in SLBs due the lack of knowledge of constraints: there exist ‘easy’ instances, for which suboptimal extreme points have large ‘gaps’, but on which SLB methods must still incur $\Omega(\sqrt{T})$ regret or safety violations, due to an inability to resolve unknown optima to arbitrary precision. We then analyse a natural doubly-optimistic strategy for the safe linear bandit problem, \textsc{doss}, which uses optimistic estimates of both reward and safety risks to select actions, and show that despite the lack of knowledge of constraints or feasible points, \textsc{doss} simultaneously obtains tight instance-dependent $O(\log^2 T)$ bounds on efficacy regret, and $\widetilde O(\sqrt{T})$ bounds on safety violations, thus attaining near Pareto-optimality. Further, when safety is demanded to a finite precision, violations improve to $O(\log^2 T).$ These results rely on a novel dual analysis of linear bandits: we argue that \textsc{doss} proceeds by activating noisy versions of at least $d$ constraints in each round, which allows us to separately analyse rounds where a ‘poor’ set of constraints is activated, and rounds where ‘good’ sets of constraints are activated. The costs in the former are controlled to $O(\log^2 T)$ by developing new dual notions of gaps, based on global sensitivity analyses of linear programs, that quantify the suboptimality of each such set of constraints. The latter costs are controlled to $O(1)$ by explicitly analysing the solutions of optimistic play. } }
Endnote
%0 Conference Paper %T Safe Linear Bandits over Unknown Polytopes %A Aditya Gangrade %A Tianrui Chen %A Venkatesh Saligrama %B Proceedings of Thirty Seventh Conference on Learning Theory %C Proceedings of Machine Learning Research %D 2024 %E Shipra Agrawal %E Aaron Roth %F pmlr-v247-gangrade24a %I PMLR %P 1755--1795 %U https://proceedings.mlr.press/v247/gangrade24a.html %V 247 %X The safe linear bandit problem (SLB) is an online approach to linear programming with unknown objective and unknown \emph{roundwise} constraints, under stochastic bandit feedback of rewards and safety risks of actions. We study the tradeoffs between efficacy and smooth safety costs of SLBs over polytopes, and the role of aggressive {doubly-optimistic play} in avoiding the strong assumptions made by extant pessimistic-optimistic approaches. We first elucidate an inherent hardness in SLBs due the lack of knowledge of constraints: there exist ‘easy’ instances, for which suboptimal extreme points have large ‘gaps’, but on which SLB methods must still incur $\Omega(\sqrt{T})$ regret or safety violations, due to an inability to resolve unknown optima to arbitrary precision. We then analyse a natural doubly-optimistic strategy for the safe linear bandit problem, \textsc{doss}, which uses optimistic estimates of both reward and safety risks to select actions, and show that despite the lack of knowledge of constraints or feasible points, \textsc{doss} simultaneously obtains tight instance-dependent $O(\log^2 T)$ bounds on efficacy regret, and $\widetilde O(\sqrt{T})$ bounds on safety violations, thus attaining near Pareto-optimality. Further, when safety is demanded to a finite precision, violations improve to $O(\log^2 T).$ These results rely on a novel dual analysis of linear bandits: we argue that \textsc{doss} proceeds by activating noisy versions of at least $d$ constraints in each round, which allows us to separately analyse rounds where a ‘poor’ set of constraints is activated, and rounds where ‘good’ sets of constraints are activated. The costs in the former are controlled to $O(\log^2 T)$ by developing new dual notions of gaps, based on global sensitivity analyses of linear programs, that quantify the suboptimality of each such set of constraints. The latter costs are controlled to $O(1)$ by explicitly analysing the solutions of optimistic play.
APA
Gangrade, A., Chen, T. & Saligrama, V.. (2024). Safe Linear Bandits over Unknown Polytopes. Proceedings of Thirty Seventh Conference on Learning Theory, in Proceedings of Machine Learning Research 247:1755-1795 Available from https://proceedings.mlr.press/v247/gangrade24a.html.

Related Material