Stochastic Regret Minimization via Thompson Sampling

Sudipto Guha; Kamesh Munagala

Stochastic Regret Minimization via Thompson Sampling

Sudipto Guha, Kamesh Munagala

Proceedings of The 27th Conference on Learning Theory, PMLR 35:317-338, 2014.

Abstract

The Thompson Sampling (TS) policy is a widely implemented algorithm for the stochastic multi-armed bandit (MAB) problem. Given a prior distribution over possible parameter settings of the underlying reward distributions of the arms, at each time instant, the policy plays an arm with probability equal to the probability that this arm has largest mean reward conditioned on the current posterior distributions of the arms. This policy generalizes the celebrated “probability matching” heuristic which has been experimentally and widely observed in human decision making. However, despite its ubiquity, the Thompson Sampling policy is poorly understood. Our goal in this paper is to make progress towards understanding the empirical success of this policy. We proceed using the lens of approximation algorithms and problem definitions from stochastic optimization. We focus on an objective function termed \em stochastic regret that captures the expected number of times the policy plays an arm that is not the eventual best arm, where the expectation is over the prior distribution. Given such a definition, we show that TS is a 2–approximation to the optimal decision policy in two extreme but canonical scenarios. One such scenario is the two-armed bandit problem which is used as a calibration point in all bandit literature. The second scenario is stochastic optimization where the outcome of a random variable is revealed in a single play to a high or low deterministic value. We show that the 2 approximation is tight in both these scenarios. We provide an uniform analysis framework that in theory is capable of proving our conjecture that the TS policy is a 2–approximation to the optimal decision policy for minimizing stochastic regret, for any prior distribution and any time horizon.

Cite this Paper

BibTeX


@InProceedings{pmlr-v35-guha14,
  title = 	 {Stochastic Regret Minimization via Thompson Sampling},
  author = 	 {Guha, Sudipto and Munagala, Kamesh},
  booktitle = 	 {Proceedings of The 27th Conference on Learning Theory},
  pages = 	 {317--338},
  year = 	 {2014},
  editor = 	 {Balcan, Maria Florina and Feldman, Vitaly and Szepesvári, Csaba},
  volume = 	 {35},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {Barcelona, Spain},
  month = 	 {13--15 Jun},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v35/guha14.pdf},
  url = 	 {https://proceedings.mlr.press/v35/guha14.html},
  abstract = 	 {The Thompson Sampling (TS) policy  is a widely implemented algorithm for the stochastic multi-armed bandit (MAB) problem. Given a prior distribution over possible parameter settings of the underlying reward distributions of the arms, at each time instant, the policy plays an arm with probability equal to the probability that this arm has largest mean reward conditioned on the current posterior distributions of the arms.  This policy generalizes the celebrated “probability matching” heuristic which has been experimentally and widely observed in human decision making. However, despite its ubiquity, the Thompson Sampling policy is poorly understood. Our goal in this paper is to make progress towards understanding the empirical success of this policy. We proceed using the lens of approximation algorithms and problem definitions from stochastic optimization. We focus on an objective function termed \em stochastic regret that captures the expected number of times the policy plays an arm that is not the eventual best arm, where the expectation is over the prior distribution. Given such a definition, we show that TS is a 2–approximation to the optimal decision policy in two extreme but canonical scenarios. One such scenario is the two-armed bandit problem which is used as a calibration point in all bandit literature. The second scenario is stochastic optimization where the outcome of a random variable is revealed in a single play to a high or low deterministic value. We show that the 2 approximation is tight in both these scenarios. We provide an uniform analysis framework that in theory is capable of proving our  conjecture that the TS policy is a 2–approximation to the optimal decision policy for minimizing stochastic regret, for any prior distribution and any time horizon.}
}

Endnote

%0 Conference Paper
%T Stochastic Regret Minimization via Thompson Sampling
%A Sudipto Guha
%A Kamesh Munagala
%B Proceedings of The 27th Conference on Learning Theory
%C Proceedings of Machine Learning Research
%D 2014
%E Maria Florina Balcan
%E Vitaly Feldman
%E Csaba Szepesvári	
%F pmlr-v35-guha14
%I PMLR
%P 317--338
%U https://proceedings.mlr.press/v35/guha14.html
%V 35
%X The Thompson Sampling (TS) policy  is a widely implemented algorithm for the stochastic multi-armed bandit (MAB) problem. Given a prior distribution over possible parameter settings of the underlying reward distributions of the arms, at each time instant, the policy plays an arm with probability equal to the probability that this arm has largest mean reward conditioned on the current posterior distributions of the arms.  This policy generalizes the celebrated “probability matching” heuristic which has been experimentally and widely observed in human decision making. However, despite its ubiquity, the Thompson Sampling policy is poorly understood. Our goal in this paper is to make progress towards understanding the empirical success of this policy. We proceed using the lens of approximation algorithms and problem definitions from stochastic optimization. We focus on an objective function termed \em stochastic regret that captures the expected number of times the policy plays an arm that is not the eventual best arm, where the expectation is over the prior distribution. Given such a definition, we show that TS is a 2–approximation to the optimal decision policy in two extreme but canonical scenarios. One such scenario is the two-armed bandit problem which is used as a calibration point in all bandit literature. The second scenario is stochastic optimization where the outcome of a random variable is revealed in a single play to a high or low deterministic value. We show that the 2 approximation is tight in both these scenarios. We provide an uniform analysis framework that in theory is capable of proving our  conjecture that the TS policy is a 2–approximation to the optimal decision policy for minimizing stochastic regret, for any prior distribution and any time horizon.

RIS


TY  - CPAPER
TI  - Stochastic Regret Minimization via Thompson Sampling
AU  - Sudipto Guha
AU  - Kamesh Munagala
BT  - Proceedings of The 27th Conference on Learning Theory
DA  - 2014/05/29
ED  - Maria Florina Balcan
ED  - Vitaly Feldman
ED  - Csaba Szepesvári	
ID  - pmlr-v35-guha14
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 35
SP  - 317
EP  - 338
L1  - http://proceedings.mlr.press/v35/guha14.pdf
UR  - https://proceedings.mlr.press/v35/guha14.html
AB  - The Thompson Sampling (TS) policy  is a widely implemented algorithm for the stochastic multi-armed bandit (MAB) problem. Given a prior distribution over possible parameter settings of the underlying reward distributions of the arms, at each time instant, the policy plays an arm with probability equal to the probability that this arm has largest mean reward conditioned on the current posterior distributions of the arms.  This policy generalizes the celebrated “probability matching” heuristic which has been experimentally and widely observed in human decision making. However, despite its ubiquity, the Thompson Sampling policy is poorly understood. Our goal in this paper is to make progress towards understanding the empirical success of this policy. We proceed using the lens of approximation algorithms and problem definitions from stochastic optimization. We focus on an objective function termed \em stochastic regret that captures the expected number of times the policy plays an arm that is not the eventual best arm, where the expectation is over the prior distribution. Given such a definition, we show that TS is a 2–approximation to the optimal decision policy in two extreme but canonical scenarios. One such scenario is the two-armed bandit problem which is used as a calibration point in all bandit literature. The second scenario is stochastic optimization where the outcome of a random variable is revealed in a single play to a high or low deterministic value. We show that the 2 approximation is tight in both these scenarios. We provide an uniform analysis framework that in theory is capable of proving our  conjecture that the TS policy is a 2–approximation to the optimal decision policy for minimizing stochastic regret, for any prior distribution and any time horizon.
ER  -

APA


Guha, S. & Munagala, K.. (2014). Stochastic Regret Minimization via Thompson Sampling. Proceedings of The 27th Conference on Learning Theory, in Proceedings of Machine Learning Research 35:317-338 Available from https://proceedings.mlr.press/v35/guha14.html.

Related Material

Download PDF