Thompson Sampling for the MNL-Bandit

Shipra Agrawal, Vashist Avadhanula, Vineet Goyal, Assaf Zeevi
Proceedings of the 2017 Conference on Learning Theory, PMLR 65:76-78, 2017.

Abstract

We consider a sequential subset selection problem under parameter uncertainty, where at each time step, the decision maker selects a subset of cardinality $K$ from $N$ possible items (arms), and observes a (bandit) feedback in the form of the index of one of the items in said subset, or none. Each item in the index set is ascribed a certain value (reward), and the feedback is governed by a Multinomial Logit (MNL) choice model whose parameters are a priori unknown. The objective of the decision maker is to maximize the expected cumulative rewards over a finite horizon $T$, or alternatively, minimize the regret relative to an oracle that knows the MNL parameters. We refer to this as the MNL-Bandit problem. This problem is representative of a larger family of exploration-exploitation problems that involve a combinatorial objective, and arise in several important application domains. We present an approach to adapt Thompson Sampling to this problem and show that it achieves near-optimal regret as well as attractive numerical performance.

Cite this Paper


BibTeX
@InProceedings{pmlr-v65-agrawal17a, title = {Thompson Sampling for the MNL-Bandit}, author = {Agrawal, Shipra and Avadhanula, Vashist and Goyal, Vineet and Zeevi, Assaf}, booktitle = {Proceedings of the 2017 Conference on Learning Theory}, pages = {76--78}, year = {2017}, editor = {Kale, Satyen and Shamir, Ohad}, volume = {65}, series = {Proceedings of Machine Learning Research}, month = {07--10 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v65/agrawal17a/agrawal17a.pdf}, url = {https://proceedings.mlr.press/v65/agrawal17a.html}, abstract = {We consider a sequential subset selection problem under parameter uncertainty, where at each time step, the decision maker selects a subset of cardinality $K$ from $N$ possible items (arms), and observes a (bandit) feedback in the form of the index of one of the items in said subset, or none. Each item in the index set is ascribed a certain value (reward), and the feedback is governed by a Multinomial Logit (MNL) choice model whose parameters are a priori unknown. The objective of the decision maker is to maximize the expected cumulative rewards over a finite horizon $T$, or alternatively, minimize the regret relative to an oracle that knows the MNL parameters. We refer to this as the MNL-Bandit problem. This problem is representative of a larger family of exploration-exploitation problems that involve a combinatorial objective, and arise in several important application domains. We present an approach to adapt Thompson Sampling to this problem and show that it achieves near-optimal regret as well as attractive numerical performance. } }
Endnote
%0 Conference Paper %T Thompson Sampling for the MNL-Bandit %A Shipra Agrawal %A Vashist Avadhanula %A Vineet Goyal %A Assaf Zeevi %B Proceedings of the 2017 Conference on Learning Theory %C Proceedings of Machine Learning Research %D 2017 %E Satyen Kale %E Ohad Shamir %F pmlr-v65-agrawal17a %I PMLR %P 76--78 %U https://proceedings.mlr.press/v65/agrawal17a.html %V 65 %X We consider a sequential subset selection problem under parameter uncertainty, where at each time step, the decision maker selects a subset of cardinality $K$ from $N$ possible items (arms), and observes a (bandit) feedback in the form of the index of one of the items in said subset, or none. Each item in the index set is ascribed a certain value (reward), and the feedback is governed by a Multinomial Logit (MNL) choice model whose parameters are a priori unknown. The objective of the decision maker is to maximize the expected cumulative rewards over a finite horizon $T$, or alternatively, minimize the regret relative to an oracle that knows the MNL parameters. We refer to this as the MNL-Bandit problem. This problem is representative of a larger family of exploration-exploitation problems that involve a combinatorial objective, and arise in several important application domains. We present an approach to adapt Thompson Sampling to this problem and show that it achieves near-optimal regret as well as attractive numerical performance.
APA
Agrawal, S., Avadhanula, V., Goyal, V. & Zeevi, A.. (2017). Thompson Sampling for the MNL-Bandit. Proceedings of the 2017 Conference on Learning Theory, in Proceedings of Machine Learning Research 65:76-78 Available from https://proceedings.mlr.press/v65/agrawal17a.html.

Related Material