Thompson Sampling for Complex Online Problems

Aditya Gopalan; Shie Mannor; Yishay Mansour

Thompson Sampling for Complex Online Problems

Aditya Gopalan, Shie Mannor, Yishay Mansour

Proceedings of the 31st International Conference on Machine Learning, PMLR 32(1):100-108, 2014.

Abstract

We consider stochastic multi-armed bandit problems with complex actions over a set of basic arms, where the decision maker plays a complex action rather than a basic arm in each round. The reward of the complex action is some function of the basic arms’ rewards, and the feedback observed may not necessarily be the reward per-arm. For instance, when the complex actions are subsets of the arms, we may only observe the maximum reward over the chosen subset. Thus, feedback across complex actions may be coupled due to the nature of the reward function. We prove a frequentist regret bound for Thompson sampling in a very general setting involving parameter, action and observation spaces and a likelihood function over them. The bound holds for discretely-supported priors over the parameter space and without additional structural properties such as closed-form posteriors, conjugate prior structure or independence across arms. The regret bound scales logarithmically with time but, more importantly, with an improved constant that non-trivially captures the coupling across complex actions due to the structure of the rewards. As applications, we derive improved regret bounds for classes of complex bandit problems involving selecting subsets of arms, including the first nontrivial regret bounds for nonlinear MAX reward feedback from subsets. Using particle filters for computing posterior distributions which lack an explicit closed-form, we present numerical results for the performance of Thompson sampling for subset-selection and job scheduling problems.

Cite this Paper

BibTeX


@InProceedings{pmlr-v32-gopalan14,
  title = 	 {Thompson Sampling for Complex Online Problems},
  author = 	 {Gopalan, Aditya and Mannor, Shie and Mansour, Yishay},
  booktitle = 	 {Proceedings of the 31st International Conference on Machine Learning},
  pages = 	 {100--108},
  year = 	 {2014},
  editor = 	 {Xing, Eric P. and Jebara, Tony},
  volume = 	 {32},
  number =       {1},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {Bejing, China},
  month = 	 {22--24 Jun},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v32/gopalan14.pdf},
  url = 	 {https://proceedings.mlr.press/v32/gopalan14.html},
  abstract = 	 {We consider stochastic multi-armed bandit problems with complex actions over a set of basic arms, where the decision maker plays a complex action rather than a basic arm in each round. The reward of the complex action is some function of the basic arms’ rewards, and the feedback observed may not necessarily be the reward per-arm. For instance, when the complex actions are subsets of the arms, we may only observe the maximum reward over the chosen subset. Thus, feedback across complex actions may be coupled due to the nature of the reward function. We prove a frequentist regret bound for Thompson sampling in a very general setting involving parameter, action and observation spaces and a likelihood function over them. The bound holds for discretely-supported priors over the parameter space and without additional structural properties such as closed-form posteriors, conjugate prior structure or independence across arms. The regret bound scales logarithmically with time but, more importantly, with an improved constant that non-trivially captures the coupling across complex actions due to the structure of the rewards. As applications, we derive improved regret bounds for classes of complex bandit problems involving selecting subsets of arms, including the first nontrivial regret bounds for nonlinear MAX reward feedback from subsets. Using particle filters for computing posterior distributions which lack an explicit closed-form, we present numerical results for the performance of Thompson sampling for subset-selection and job scheduling problems.}
}

Endnote

%0 Conference Paper
%T Thompson Sampling for Complex Online Problems
%A Aditya Gopalan
%A Shie Mannor
%A Yishay Mansour
%B Proceedings of the 31st International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2014
%E Eric P. Xing
%E Tony Jebara	
%F pmlr-v32-gopalan14
%I PMLR
%P 100--108
%U https://proceedings.mlr.press/v32/gopalan14.html
%V 32
%N 1
%X We consider stochastic multi-armed bandit problems with complex actions over a set of basic arms, where the decision maker plays a complex action rather than a basic arm in each round. The reward of the complex action is some function of the basic arms’ rewards, and the feedback observed may not necessarily be the reward per-arm. For instance, when the complex actions are subsets of the arms, we may only observe the maximum reward over the chosen subset. Thus, feedback across complex actions may be coupled due to the nature of the reward function. We prove a frequentist regret bound for Thompson sampling in a very general setting involving parameter, action and observation spaces and a likelihood function over them. The bound holds for discretely-supported priors over the parameter space and without additional structural properties such as closed-form posteriors, conjugate prior structure or independence across arms. The regret bound scales logarithmically with time but, more importantly, with an improved constant that non-trivially captures the coupling across complex actions due to the structure of the rewards. As applications, we derive improved regret bounds for classes of complex bandit problems involving selecting subsets of arms, including the first nontrivial regret bounds for nonlinear MAX reward feedback from subsets. Using particle filters for computing posterior distributions which lack an explicit closed-form, we present numerical results for the performance of Thompson sampling for subset-selection and job scheduling problems.

RIS


TY  - CPAPER
TI  - Thompson Sampling for Complex Online Problems
AU  - Aditya Gopalan
AU  - Shie Mannor
AU  - Yishay Mansour
BT  - Proceedings of the 31st International Conference on Machine Learning
DA  - 2014/01/27
ED  - Eric P. Xing
ED  - Tony Jebara	
ID  - pmlr-v32-gopalan14
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 32
IS  - 1
SP  - 100
EP  - 108
L1  - http://proceedings.mlr.press/v32/gopalan14.pdf
UR  - https://proceedings.mlr.press/v32/gopalan14.html
AB  - We consider stochastic multi-armed bandit problems with complex actions over a set of basic arms, where the decision maker plays a complex action rather than a basic arm in each round. The reward of the complex action is some function of the basic arms’ rewards, and the feedback observed may not necessarily be the reward per-arm. For instance, when the complex actions are subsets of the arms, we may only observe the maximum reward over the chosen subset. Thus, feedback across complex actions may be coupled due to the nature of the reward function. We prove a frequentist regret bound for Thompson sampling in a very general setting involving parameter, action and observation spaces and a likelihood function over them. The bound holds for discretely-supported priors over the parameter space and without additional structural properties such as closed-form posteriors, conjugate prior structure or independence across arms. The regret bound scales logarithmically with time but, more importantly, with an improved constant that non-trivially captures the coupling across complex actions due to the structure of the rewards. As applications, we derive improved regret bounds for classes of complex bandit problems involving selecting subsets of arms, including the first nontrivial regret bounds for nonlinear MAX reward feedback from subsets. Using particle filters for computing posterior distributions which lack an explicit closed-form, we present numerical results for the performance of Thompson sampling for subset-selection and job scheduling problems.
ER  -

APA


Gopalan, A., Mannor, S. & Mansour, Y.. (2014). Thompson Sampling for Complex Online Problems. Proceedings of the 31st International Conference on Machine Learning, in Proceedings of Machine Learning Research 32(1):100-108 Available from https://proceedings.mlr.press/v32/gopalan14.html.

Thompson Sampling for Complex Online Problems

Abstract

Cite this Paper

Related Material