[edit]
Feasible Action Search for Bandit Linear Programs via Thompson Sampling
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:18228-18256, 2025.
Abstract
We study the ’feasible action search’ (FAS) problem for linear bandits, wherein a learner attempts to discover a feasible point for a set of linear constraints $\Phi_* a \ge 0,$ without knowledge of the matrix $\Phi_* \in \mathbb{R}^{m \times d}$. A FAS learner selects a sequence of actions $a_t,$ and uses observations of the form $\Phi_* a_t + \mathrm{noise}$ to either find a point with nearly optimal ’safety margin’, or detect that the constraints are infeasible, where the safety margin of an action measures its (signed) distance from the constraint boundary. While of interest in its own right, the FAS problem also directly addresses a key deficiency in the extant theory of ’safe linear bandits’ (SLBs), by discovering a safe initialisation for low-regret SLB methods. We propose and analyse a novel efficient FAS-learner. Our method, FAST, is based on Thompson Sampling. It applies a coupled random perturbation to an estimate of $\Phi_*,$ and plays a maximin point of a game induced by this perturbed matrix. We prove that FAST stops in $\tilde{O}(d^3/\varepsilon^2 M_*^2)$ steps, and incurs $\tilde{O}(d^3/|M_*|)$ safety costs, to either correctly detect infeasibility, or output a point that is at least $(1-\varepsilon) M_*$-safe, where $M_*$ is the optimal safety margin of $\Phi_*$. Further, instantiating prior SLB methods with the output of FAS yields the first SLB methods that incur $\tilde{O}(\sqrt{d^3 T/M_*^2})$ regret and $O(1)$ risk without a priori knowledge of a safe action. The main technical novelty lies in the extension of Thompson Sampling to this multiobjective setting, for which we both propose a coupled noise design, and provide an analysis that avoids convexity considerations.