Thompson Sampling for Linearly Constrained Bandits

Vidit Saxena; Joakim Jalden; Joseph Gonzalez

Thompson Sampling for Linearly Constrained Bandits

Vidit Saxena, Joakim Jalden, Joseph Gonzalez

Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, PMLR 108:1999-2009, 2020.

Abstract

We address multi-armed bandits (MAB) where the objective is to maximize the cumulative reward under a probabilistic linear constraint. For a few real-world instances of this problem, constrained extensions of the well-known Thompson Sampling (TS) heuristic have recently been proposed. However, finite-time analysis of constrained TS is challenging; as a result, only O( sqrt( T ) ) bounds on the cumulative reward loss (i.e., the regret) are available. In this paper, we describe LinConTS, a TS-based algorithm for bandits that place a linear constraint on the probability of earning a reward in every round. We show that for LinConTS, the regret as well as the cumulative constraint violations are upper bounded by O( log ( T ) ). We develop a proof technique that relies on careful analysis of the dual problem and combine it with recent theoretical work on unconstrained TS. Through numerical experiments on two real-world datasets, we demonstrate that LinConTS outperforms an asymptotically optimal upper confidence bound (UCB) scheme in terms of simultaneously minimizing the regret and the violation.

Cite this Paper

BibTeX


@InProceedings{pmlr-v108-saxena20a,
  title = 	 {Thompson Sampling for Linearly Constrained Bandits},
  author =       {Saxena, Vidit and Jalden, Joakim and Gonzalez, Joseph},
  booktitle = 	 {Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics},
  pages = 	 {1999--2009},
  year = 	 {2020},
  editor = 	 {Chiappa, Silvia and Calandra, Roberto},
  volume = 	 {108},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {26--28 Aug},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v108/saxena20a/saxena20a.pdf},
  url = 	 {https://proceedings.mlr.press/v108/saxena20a.html},
  abstract = 	 {We address multi-armed bandits (MAB) where the objective is to maximize the cumulative reward under a probabilistic linear constraint. For a few real-world instances of this problem, constrained extensions of the well-known Thompson Sampling (TS) heuristic have recently been proposed. However, finite-time analysis of constrained TS is challenging; as a result, only O( sqrt( T ) ) bounds on the cumulative reward loss (i.e., the regret) are available. In this paper, we describe LinConTS, a TS-based algorithm for bandits that place a linear constraint on the probability of earning a reward in every round. We show that for LinConTS, the regret as well as the cumulative constraint violations are upper bounded by O( log ( T ) ). We develop a proof technique that relies on careful analysis of the dual problem and combine it with recent theoretical work on unconstrained TS. Through numerical experiments on two real-world datasets, we demonstrate that LinConTS outperforms an asymptotically optimal upper confidence bound (UCB) scheme in terms of simultaneously minimizing the regret and the violation.}
}

Endnote

%0 Conference Paper
%T Thompson Sampling for Linearly Constrained Bandits
%A Vidit Saxena
%A Joakim Jalden
%A Joseph Gonzalez
%B Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics
%C Proceedings of Machine Learning Research
%D 2020
%E Silvia Chiappa
%E Roberto Calandra	
%F pmlr-v108-saxena20a
%I PMLR
%P 1999--2009
%U https://proceedings.mlr.press/v108/saxena20a.html
%V 108
%X We address multi-armed bandits (MAB) where the objective is to maximize the cumulative reward under a probabilistic linear constraint. For a few real-world instances of this problem, constrained extensions of the well-known Thompson Sampling (TS) heuristic have recently been proposed. However, finite-time analysis of constrained TS is challenging; as a result, only O( sqrt( T ) ) bounds on the cumulative reward loss (i.e., the regret) are available. In this paper, we describe LinConTS, a TS-based algorithm for bandits that place a linear constraint on the probability of earning a reward in every round. We show that for LinConTS, the regret as well as the cumulative constraint violations are upper bounded by O( log ( T ) ). We develop a proof technique that relies on careful analysis of the dual problem and combine it with recent theoretical work on unconstrained TS. Through numerical experiments on two real-world datasets, we demonstrate that LinConTS outperforms an asymptotically optimal upper confidence bound (UCB) scheme in terms of simultaneously minimizing the regret and the violation.

APA


Saxena, V., Jalden, J. & Gonzalez, J.. (2020). Thompson Sampling for Linearly Constrained Bandits. Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 108:1999-2009 Available from https://proceedings.mlr.press/v108/saxena20a.html.

Thompson Sampling for Linearly Constrained Bandits

Abstract

Cite this Paper

Related Material