Learning to Plan Variable Length Sequences of Actions with a Cascading Bandit Click Model of User Feedback

Anirban Santara, Gaurav Aggarwal, Shuai Li, Claudio Gentile
Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, PMLR 151:767-797, 2022.

Abstract

Motivated by problems of ranking with partial information, we introduce a variant of the cascading bandit model that considers flexible length sequences with varying rewards and losses. We formulate two generative models for this problem within the generalized linear setting, and design and analyze upper confidence algorithms for it. Our analysis delivers tight regret bounds which, when specialized to standard cascading bandits, results in sharper guarantees than previously available in the literature. We evaluate our algorithms against a representative sample of cascading bandit baselines on a number of real-world datasets and show significantly improved empirical performance.

Cite this Paper


BibTeX
@InProceedings{pmlr-v151-santara22a, title = { Learning to Plan Variable Length Sequences of Actions with a Cascading Bandit Click Model of User Feedback }, author = {Santara, Anirban and Aggarwal, Gaurav and Li, Shuai and Gentile, Claudio}, booktitle = {Proceedings of The 25th International Conference on Artificial Intelligence and Statistics}, pages = {767--797}, year = {2022}, editor = {Camps-Valls, Gustau and Ruiz, Francisco J. R. and Valera, Isabel}, volume = {151}, series = {Proceedings of Machine Learning Research}, month = {28--30 Mar}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v151/santara22a/santara22a.pdf}, url = {https://proceedings.mlr.press/v151/santara22a.html}, abstract = { Motivated by problems of ranking with partial information, we introduce a variant of the cascading bandit model that considers flexible length sequences with varying rewards and losses. We formulate two generative models for this problem within the generalized linear setting, and design and analyze upper confidence algorithms for it. Our analysis delivers tight regret bounds which, when specialized to standard cascading bandits, results in sharper guarantees than previously available in the literature. We evaluate our algorithms against a representative sample of cascading bandit baselines on a number of real-world datasets and show significantly improved empirical performance. } }
Endnote
%0 Conference Paper %T Learning to Plan Variable Length Sequences of Actions with a Cascading Bandit Click Model of User Feedback %A Anirban Santara %A Gaurav Aggarwal %A Shuai Li %A Claudio Gentile %B Proceedings of The 25th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2022 %E Gustau Camps-Valls %E Francisco J. R. Ruiz %E Isabel Valera %F pmlr-v151-santara22a %I PMLR %P 767--797 %U https://proceedings.mlr.press/v151/santara22a.html %V 151 %X Motivated by problems of ranking with partial information, we introduce a variant of the cascading bandit model that considers flexible length sequences with varying rewards and losses. We formulate two generative models for this problem within the generalized linear setting, and design and analyze upper confidence algorithms for it. Our analysis delivers tight regret bounds which, when specialized to standard cascading bandits, results in sharper guarantees than previously available in the literature. We evaluate our algorithms against a representative sample of cascading bandit baselines on a number of real-world datasets and show significantly improved empirical performance.
APA
Santara, A., Aggarwal, G., Li, S. & Gentile, C.. (2022). Learning to Plan Variable Length Sequences of Actions with a Cascading Bandit Click Model of User Feedback . Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 151:767-797 Available from https://proceedings.mlr.press/v151/santara22a.html.

Related Material