Thompson Sampling for the MNL-Bandit

Shipra Agrawal; Vashist Avadhanula; Vineet Goyal; Assaf Zeevi

Thompson Sampling for the MNL-Bandit

Shipra Agrawal, Vashist Avadhanula, Vineet Goyal, Assaf Zeevi

Proceedings of the 2017 Conference on Learning Theory, PMLR 65:76-78, 2017.

Abstract

We consider a sequential subset selection problem under parameter uncertainty, where at each time step, the decision maker selects a subset of cardinality $K$ from $N$ possible items (arms), and observes a (bandit) feedback in the form of the index of one of the items in said subset, or none. Each item in the index set is ascribed a certain value (reward), and the feedback is governed by a Multinomial Logit (MNL) choice model whose parameters are a priori unknown. The objective of the decision maker is to maximize the expected cumulative rewards over a finite horizon $T$, or alternatively, minimize the regret relative to an oracle that knows the MNL parameters. We refer to this as the MNL-Bandit problem. This problem is representative of a larger family of exploration-exploitation problems that involve a combinatorial objective, and arise in several important application domains. We present an approach to adapt Thompson Sampling to this problem and show that it achieves near-optimal regret as well as attractive numerical performance.

Cite this Paper

BibTeX

@InProceedings{pmlr-v65-agrawal17a,
  title = 	 {Thompson Sampling for the MNL-Bandit},
  author = 	 {Agrawal, Shipra and Avadhanula, Vashist and Goyal, Vineet and Zeevi, Assaf},
  booktitle = 	 {Proceedings of the 2017 Conference on Learning Theory},
  pages = 	 {76--78},
  year = 	 {2017},
  editor = 	 {Kale, Satyen and Shamir, Ohad},
  volume = 	 {65},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {07--10 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v65/agrawal17a/agrawal17a.pdf},
  url = 	 {https://proceedings.mlr.press/v65/agrawal17a.html},
  abstract = 	 {We consider a sequential subset selection problem under parameter uncertainty, where at each time step, the decision maker selects a subset of cardinality $K$ from $N$ possible  items (arms), and observes a (bandit) feedback in the form of the index of one of the items in said subset, or none. Each item in the index set is ascribed a certain value (reward), and the feedback is governed by a Multinomial Logit (MNL) choice model whose parameters are a priori unknown.  The objective of the decision maker is to maximize the expected cumulative rewards over a finite horizon $T$, or alternatively, minimize the regret relative to an oracle that knows the MNL parameters.  We refer to this as the MNL-Bandit problem. This problem is representative of a larger family of exploration-exploitation problems that involve a combinatorial objective, and arise in several important application domains. We present an approach to adapt Thompson Sampling to this problem and show that it achieves near-optimal regret as well as attractive numerical performance.  }
}

Endnote

%0 Conference Paper
%T Thompson Sampling for the MNL-Bandit
%A Shipra Agrawal
%A Vashist Avadhanula
%A Vineet Goyal
%A Assaf Zeevi
%B Proceedings of the 2017 Conference on Learning Theory
%C Proceedings of Machine Learning Research
%D 2017
%E Satyen Kale
%E Ohad Shamir	
%F pmlr-v65-agrawal17a
%I PMLR
%P 76--78
%U https://proceedings.mlr.press/v65/agrawal17a.html
%V 65
%X We consider a sequential subset selection problem under parameter uncertainty, where at each time step, the decision maker selects a subset of cardinality $K$ from $N$ possible  items (arms), and observes a (bandit) feedback in the form of the index of one of the items in said subset, or none. Each item in the index set is ascribed a certain value (reward), and the feedback is governed by a Multinomial Logit (MNL) choice model whose parameters are a priori unknown.  The objective of the decision maker is to maximize the expected cumulative rewards over a finite horizon $T$, or alternatively, minimize the regret relative to an oracle that knows the MNL parameters.  We refer to this as the MNL-Bandit problem. This problem is representative of a larger family of exploration-exploitation problems that involve a combinatorial objective, and arise in several important application domains. We present an approach to adapt Thompson Sampling to this problem and show that it achieves near-optimal regret as well as attractive numerical performance.

APA

Agrawal, S., Avadhanula, V., Goyal, V. & Zeevi, A.. (2017). Thompson Sampling for the MNL-Bandit. Proceedings of the 2017 Conference on Learning Theory, in Proceedings of Machine Learning Research 65:76-78 Available from https://proceedings.mlr.press/v65/agrawal17a.html.

Related Material

Download PDF