Linear Thompson Sampling Revisited

Marc Abeille, Alessandro Lazaric
Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, PMLR 54:176-184, 2017.

Abstract

We derive an alternative proof for the regret of Thompson sampling (TS) in the stochastic linear bandit setting. While we obtain a regret bound of order $O(d^3/2\sqrtT)$ as in previous results, the proof sheds new light on the functioning of the TS. We leverage on the structure of the problem to show how the regret is related to the sensitivity (i.e., the gradient) of the objective function and how selecting optimal arms associated to \textitoptimistic parameters does control it. Thus we show that TS can be seen as a generic randomized algorithm where the sampling distribution is designed to have a fixed probability of being optimistic, at the cost of an additional $\sqrtd$ regret factor compared to a UCB-like approach. Furthermore, we show that our proof can be readily applied to regularized linear optimization and generalized linear model problems.

Cite this Paper


BibTeX
@InProceedings{pmlr-v54-abeille17a, title = {{Linear Thompson Sampling Revisited}}, author = {Abeille, Marc and Lazaric, Alessandro}, booktitle = {Proceedings of the 20th International Conference on Artificial Intelligence and Statistics}, pages = {176--184}, year = {2017}, editor = {Singh, Aarti and Zhu, Jerry}, volume = {54}, series = {Proceedings of Machine Learning Research}, month = {20--22 Apr}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v54/abeille17a/abeille17a.pdf}, url = {https://proceedings.mlr.press/v54/abeille17a.html}, abstract = {We derive an alternative proof for the regret of Thompson sampling (TS) in the stochastic linear bandit setting. While we obtain a regret bound of order $O(d^3/2\sqrtT)$ as in previous results, the proof sheds new light on the functioning of the TS. We leverage on the structure of the problem to show how the regret is related to the sensitivity (i.e., the gradient) of the objective function and how selecting optimal arms associated to \textitoptimistic parameters does control it. Thus we show that TS can be seen as a generic randomized algorithm where the sampling distribution is designed to have a fixed probability of being optimistic, at the cost of an additional $\sqrtd$ regret factor compared to a UCB-like approach. Furthermore, we show that our proof can be readily applied to regularized linear optimization and generalized linear model problems.} }
Endnote
%0 Conference Paper %T Linear Thompson Sampling Revisited %A Marc Abeille %A Alessandro Lazaric %B Proceedings of the 20th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2017 %E Aarti Singh %E Jerry Zhu %F pmlr-v54-abeille17a %I PMLR %P 176--184 %U https://proceedings.mlr.press/v54/abeille17a.html %V 54 %X We derive an alternative proof for the regret of Thompson sampling (TS) in the stochastic linear bandit setting. While we obtain a regret bound of order $O(d^3/2\sqrtT)$ as in previous results, the proof sheds new light on the functioning of the TS. We leverage on the structure of the problem to show how the regret is related to the sensitivity (i.e., the gradient) of the objective function and how selecting optimal arms associated to \textitoptimistic parameters does control it. Thus we show that TS can be seen as a generic randomized algorithm where the sampling distribution is designed to have a fixed probability of being optimistic, at the cost of an additional $\sqrtd$ regret factor compared to a UCB-like approach. Furthermore, we show that our proof can be readily applied to regularized linear optimization and generalized linear model problems.
APA
Abeille, M. & Lazaric, A.. (2017). Linear Thompson Sampling Revisited. Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 54:176-184 Available from https://proceedings.mlr.press/v54/abeille17a.html.

Related Material