Thompson Sampling for Contextual Bandits with Linear Payoffs

Shipra Agrawal; Navin Goyal

Thompson Sampling for Contextual Bandits with Linear Payoffs

Shipra Agrawal, Navin Goyal

Proceedings of the 30th International Conference on Machine Learning, PMLR 28(3):127-135, 2013.

Abstract

Thompson Sampling is one of the oldest heuristics for multi-armed bandit problems. It is a randomized algorithm based on Bayesian ideas, and has recently generated significant interest after several studies demonstrated it to have better empirical performance compared to the state of the art methods. However, many questions regarding its theoretical performance remained open. In this paper, we design and analyze Thompson Sampling algorithm for the stochastic contextual multi-armed bandit problem with linear payoff functions, when the contexts are provided by an adaptive adversary. This is among the most important and widely studied version of the contextual bandits problem. We prove a high probability regret bound of \tildeO(\fracd\sqrtε\sqrtT^1+ε) in time T for any ε∈(0,1), where d is the dimension of each context vector and εis a parameter used by the algorithm. Our results provide the first theoretical guarantees for the contextual version of Thompson Sampling, and are close to the lower bound of Ω(\sqrtdT) for this problem. This essentially solves the COLT open problem of Chapelle and Li [COLT 2012] regarding regret bounds for Thompson Sampling for contextual bandits problem with linear payoff functions. Our version of Thompson sampling uses Gaussian prior and Gaussian likelihood function. Our novel martingale-based analysis techniques also allow easy extensions to the use of other distributions, satisfying certain general conditions.

Cite this Paper

BibTeX


@InProceedings{pmlr-v28-agrawal13,
  title = 	 {Thompson Sampling for Contextual Bandits with Linear Payoffs},
  author = 	 {Agrawal, Shipra and Goyal, Navin},
  booktitle = 	 {Proceedings of the 30th International Conference on Machine Learning},
  pages = 	 {127--135},
  year = 	 {2013},
  editor = 	 {Dasgupta, Sanjoy and McAllester, David},
  volume = 	 {28},
  number =       {3},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {Atlanta, Georgia, USA},
  month = 	 {17--19 Jun},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v28/agrawal13.pdf},
  url = 	 {https://proceedings.mlr.press/v28/agrawal13.html},
  abstract = 	 {Thompson Sampling is one of the oldest heuristics for multi-armed bandit problems. It is a randomized algorithm based on Bayesian ideas, and has recently generated significant interest after several studies demonstrated it to have better empirical performance compared to the state of the art methods. However, many questions regarding its theoretical performance remained open. In this paper, we design and analyze Thompson Sampling algorithm for the stochastic contextual multi-armed bandit problem with linear payoff functions, when the contexts are provided by an adaptive adversary. This is among the most important and widely studied version of the contextual bandits problem. We prove a high probability regret bound of \tildeO(\fracd\sqrtε\sqrtT^1+ε) in time T for any ε∈(0,1), where d is the dimension of each context vector and εis a parameter used by the algorithm. Our results provide the first theoretical guarantees for the contextual version of Thompson Sampling, and are close to the lower bound of Ω(\sqrtdT) for this problem. This essentially solves the COLT open problem of Chapelle and Li [COLT 2012] regarding regret bounds for Thompson Sampling for contextual bandits problem with linear payoff functions.     Our version of Thompson sampling uses Gaussian prior and Gaussian likelihood function. Our novel martingale-based analysis techniques also allow easy extensions to the use of other distributions, satisfying certain general conditions.   }
}

Endnote

%0 Conference Paper
%T Thompson Sampling for Contextual Bandits with Linear Payoffs
%A Shipra Agrawal
%A Navin Goyal
%B Proceedings of the 30th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2013
%E Sanjoy Dasgupta
%E David McAllester	
%F pmlr-v28-agrawal13
%I PMLR
%P 127--135
%U https://proceedings.mlr.press/v28/agrawal13.html
%V 28
%N 3
%X Thompson Sampling is one of the oldest heuristics for multi-armed bandit problems. It is a randomized algorithm based on Bayesian ideas, and has recently generated significant interest after several studies demonstrated it to have better empirical performance compared to the state of the art methods. However, many questions regarding its theoretical performance remained open. In this paper, we design and analyze Thompson Sampling algorithm for the stochastic contextual multi-armed bandit problem with linear payoff functions, when the contexts are provided by an adaptive adversary. This is among the most important and widely studied version of the contextual bandits problem. We prove a high probability regret bound of \tildeO(\fracd\sqrtε\sqrtT^1+ε) in time T for any ε∈(0,1), where d is the dimension of each context vector and εis a parameter used by the algorithm. Our results provide the first theoretical guarantees for the contextual version of Thompson Sampling, and are close to the lower bound of Ω(\sqrtdT) for this problem. This essentially solves the COLT open problem of Chapelle and Li [COLT 2012] regarding regret bounds for Thompson Sampling for contextual bandits problem with linear payoff functions.     Our version of Thompson sampling uses Gaussian prior and Gaussian likelihood function. Our novel martingale-based analysis techniques also allow easy extensions to the use of other distributions, satisfying certain general conditions.

RIS


TY  - CPAPER
TI  - Thompson Sampling for Contextual Bandits with Linear Payoffs
AU  - Shipra Agrawal
AU  - Navin Goyal
BT  - Proceedings of the 30th International Conference on Machine Learning
DA  - 2013/05/26
ED  - Sanjoy Dasgupta
ED  - David McAllester	
ID  - pmlr-v28-agrawal13
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 28
IS  - 3
SP  - 127
EP  - 135
L1  - http://proceedings.mlr.press/v28/agrawal13.pdf
UR  - https://proceedings.mlr.press/v28/agrawal13.html
AB  - Thompson Sampling is one of the oldest heuristics for multi-armed bandit problems. It is a randomized algorithm based on Bayesian ideas, and has recently generated significant interest after several studies demonstrated it to have better empirical performance compared to the state of the art methods. However, many questions regarding its theoretical performance remained open. In this paper, we design and analyze Thompson Sampling algorithm for the stochastic contextual multi-armed bandit problem with linear payoff functions, when the contexts are provided by an adaptive adversary. This is among the most important and widely studied version of the contextual bandits problem. We prove a high probability regret bound of \tildeO(\fracd\sqrtε\sqrtT^1+ε) in time T for any ε∈(0,1), where d is the dimension of each context vector and εis a parameter used by the algorithm. Our results provide the first theoretical guarantees for the contextual version of Thompson Sampling, and are close to the lower bound of Ω(\sqrtdT) for this problem. This essentially solves the COLT open problem of Chapelle and Li [COLT 2012] regarding regret bounds for Thompson Sampling for contextual bandits problem with linear payoff functions.     Our version of Thompson sampling uses Gaussian prior and Gaussian likelihood function. Our novel martingale-based analysis techniques also allow easy extensions to the use of other distributions, satisfying certain general conditions.   
ER  -

APA


Agrawal, S. & Goyal, N.. (2013). Thompson Sampling for Contextual Bandits with Linear Payoffs. Proceedings of the 30th International Conference on Machine Learning, in Proceedings of Machine Learning Research 28(3):127-135 Available from https://proceedings.mlr.press/v28/agrawal13.html.

Thompson Sampling for Contextual Bandits with Linear Payoffs

Abstract

Cite this Paper

Related Material