Safe Linear Bandits over Unknown Polytopes

Aditya Gangrade; Tianrui Chen; Venkatesh Saligrama

Safe Linear Bandits over Unknown Polytopes

Aditya Gangrade, Tianrui Chen, Venkatesh Saligrama

Proceedings of Thirty Seventh Conference on Learning Theory, PMLR 247:1755-1795, 2024.

Abstract

The safe linear bandit problem (SLB) is an online approach to linear programming with unknown objective and unknown \emph{roundwise} constraints, under stochastic bandit feedback of rewards and safety risks of actions. We study the tradeoffs between efficacy and smooth safety costs of SLBs over polytopes, and the role of aggressive {doubly-optimistic play} in avoiding the strong assumptions made by extant pessimistic-optimistic approaches. We first elucidate an inherent hardness in SLBs due the lack of knowledge of constraints: there exist ‘easy’ instances, for which suboptimal extreme points have large ‘gaps’, but on which SLB methods must still incur

$\Omega(\sqrt{T})$ regret or safety violations, due to an inability to resolve unknown optima to arbitrary precision. We then analyse a natural doubly-optimistic strategy for the safe linear bandit problem, \textsc{doss}, which uses optimistic estimates of both reward and safety risks to select actions, and show that despite the lack of knowledge of constraints or feasible points, \textsc{doss} simultaneously obtains tight instance-dependent

$O(\log^2 T)$ bounds on efficacy regret, and

$\widetilde O(\sqrt{T})$ bounds on safety violations, thus attaining near Pareto-optimality. Further, when safety is demanded to a finite precision, violations improve to

$O(\log^2 T).$ These results rely on a novel dual analysis of linear bandits: we argue that \textsc{doss} proceeds by activating noisy versions of at least

$d$ constraints in each round, which allows us to separately analyse rounds where a ‘poor’ set of constraints is activated, and rounds where ‘good’ sets of constraints are activated. The costs in the former are controlled to

$O(\log^2 T)$ by developing new dual notions of gaps, based on global sensitivity analyses of linear programs, that quantify the suboptimality of each such set of constraints. The latter costs are controlled to

$O(1)$ by explicitly analysing the solutions of optimistic play.

Cite this Paper

BibTeX


@InProceedings{pmlr-v247-gangrade24a,
  title = 	 {Safe Linear Bandits over Unknown Polytopes},
  author =       {Gangrade, Aditya and Chen, Tianrui and Saligrama, Venkatesh},
  booktitle = 	 {Proceedings of Thirty Seventh Conference on Learning Theory},
  pages = 	 {1755--1795},
  year = 	 {2024},
  editor = 	 {Agrawal, Shipra and Roth, Aaron},
  volume = 	 {247},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {30 Jun--03 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v247/gangrade24a/gangrade24a.pdf},
  url = 	 {https://proceedings.mlr.press/v247/gangrade24a.html},
  abstract = 	 {The safe linear bandit problem (SLB) is an online approach to linear programming with unknown objective and unknown \emph{roundwise} constraints, under stochastic bandit feedback of rewards and safety risks of actions. We study the tradeoffs between efficacy and smooth safety costs of SLBs over polytopes, and the role of aggressive {doubly-optimistic play} in avoiding the strong assumptions made by extant pessimistic-optimistic approaches.  We first elucidate an inherent hardness in SLBs due the lack of knowledge of constraints: there exist ‘easy’ instances, for which suboptimal extreme points have large ‘gaps’, but on which SLB methods must still incur $\Omega(\sqrt{T})$ regret or safety violations, due to an inability to resolve unknown optima to arbitrary precision. We then analyse a natural doubly-optimistic strategy for the safe linear bandit problem, \textsc{doss}, which uses optimistic estimates of both reward and safety risks to select actions, and show that despite the lack of knowledge of constraints or feasible points, \textsc{doss} simultaneously obtains tight instance-dependent $O(\log^2 T)$ bounds on efficacy regret, and $\widetilde O(\sqrt{T})$ bounds on safety violations, thus attaining near Pareto-optimality. Further, when safety is demanded to a finite precision, violations improve to $O(\log^2 T).$ These results rely on a novel dual analysis of linear bandits: we argue that \textsc{doss} proceeds by activating noisy versions of at least $d$ constraints in each round, which allows us to separately analyse rounds where a ‘poor’ set of constraints is activated, and rounds where ‘good’ sets of constraints are activated. The costs in the former are controlled to $O(\log^2 T)$ by developing new dual notions of gaps, based on global sensitivity analyses of linear programs, that quantify the suboptimality of each such set of constraints. The latter costs are controlled to $O(1)$ by explicitly analysing the solutions of optimistic play. }
}

Endnote

%0 Conference Paper
%T Safe Linear Bandits over Unknown Polytopes
%A Aditya Gangrade
%A Tianrui Chen
%A Venkatesh Saligrama
%B Proceedings of Thirty Seventh Conference on Learning Theory
%C Proceedings of Machine Learning Research
%D 2024
%E Shipra Agrawal
%E Aaron Roth	
%F pmlr-v247-gangrade24a
%I PMLR
%P 1755--1795
%U https://proceedings.mlr.press/v247/gangrade24a.html
%V 247
%X The safe linear bandit problem (SLB) is an online approach to linear programming with unknown objective and unknown \emph{roundwise} constraints, under stochastic bandit feedback of rewards and safety risks of actions. We study the tradeoffs between efficacy and smooth safety costs of SLBs over polytopes, and the role of aggressive {doubly-optimistic play} in avoiding the strong assumptions made by extant pessimistic-optimistic approaches.  We first elucidate an inherent hardness in SLBs due the lack of knowledge of constraints: there exist ‘easy’ instances, for which suboptimal extreme points have large ‘gaps’, but on which SLB methods must still incur $\Omega(\sqrt{T})$ regret or safety violations, due to an inability to resolve unknown optima to arbitrary precision. We then analyse a natural doubly-optimistic strategy for the safe linear bandit problem, \textsc{doss}, which uses optimistic estimates of both reward and safety risks to select actions, and show that despite the lack of knowledge of constraints or feasible points, \textsc{doss} simultaneously obtains tight instance-dependent $O(\log^2 T)$ bounds on efficacy regret, and $\widetilde O(\sqrt{T})$ bounds on safety violations, thus attaining near Pareto-optimality. Further, when safety is demanded to a finite precision, violations improve to $O(\log^2 T).$ These results rely on a novel dual analysis of linear bandits: we argue that \textsc{doss} proceeds by activating noisy versions of at least $d$ constraints in each round, which allows us to separately analyse rounds where a ‘poor’ set of constraints is activated, and rounds where ‘good’ sets of constraints are activated. The costs in the former are controlled to $O(\log^2 T)$ by developing new dual notions of gaps, based on global sensitivity analyses of linear programs, that quantify the suboptimality of each such set of constraints. The latter costs are controlled to $O(1)$ by explicitly analysing the solutions of optimistic play.

APA


Gangrade, A., Chen, T. & Saligrama, V.. (2024). Safe Linear Bandits over Unknown Polytopes. Proceedings of Thirty Seventh Conference on Learning Theory, in Proceedings of Machine Learning Research 247:1755-1795 Available from https://proceedings.mlr.press/v247/gangrade24a.html.

Related Material

Download PDF