Top K Ranking for Multi-Armed Bandit with Noisy Evaluations

Evrard Garcelon, Vashist Avadhanula, Alessandro Lazaric, Matteo Pirotta
Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, PMLR 151:6242-6269, 2022.

Abstract

We consider a multi-armed bandit setting where, at the beginning of each round, the learner receives noisy independent, and possibly biased, evaluations of the true reward of each arm and it selects $K$ arms with the objective of accumulating as much reward as possible over $T$ rounds. Under the assumption that at each round the true reward of each arm is drawn from a fixed distribution, we derive different algorithmic approaches and theoretical guarantees depending on how the evaluations are generated. First, we show a $\widetilde{O}(T^{2/3})$ regret in the general case when the observation functions are a genearalized linear function of the true rewards. On the other hand, we show that an improved $\widetilde{O}(\sqrt{T})$ regret can be derived when the observation functions are noisy linear functions of the true rewards. Finally, we report an empirical validation that confirms our theoretical findings, provides a thorough comparison to alternative approaches, and further supports the interest of this setting in practice.

Cite this Paper


BibTeX
@InProceedings{pmlr-v151-garcelon22b, title = { Top K Ranking for Multi-Armed Bandit with Noisy Evaluations }, author = {Garcelon, Evrard and Avadhanula, Vashist and Lazaric, Alessandro and Pirotta, Matteo}, booktitle = {Proceedings of The 25th International Conference on Artificial Intelligence and Statistics}, pages = {6242--6269}, year = {2022}, editor = {Camps-Valls, Gustau and Ruiz, Francisco J. R. and Valera, Isabel}, volume = {151}, series = {Proceedings of Machine Learning Research}, month = {28--30 Mar}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v151/garcelon22b/garcelon22b.pdf}, url = {https://proceedings.mlr.press/v151/garcelon22b.html}, abstract = { We consider a multi-armed bandit setting where, at the beginning of each round, the learner receives noisy independent, and possibly biased, evaluations of the true reward of each arm and it selects $K$ arms with the objective of accumulating as much reward as possible over $T$ rounds. Under the assumption that at each round the true reward of each arm is drawn from a fixed distribution, we derive different algorithmic approaches and theoretical guarantees depending on how the evaluations are generated. First, we show a $\widetilde{O}(T^{2/3})$ regret in the general case when the observation functions are a genearalized linear function of the true rewards. On the other hand, we show that an improved $\widetilde{O}(\sqrt{T})$ regret can be derived when the observation functions are noisy linear functions of the true rewards. Finally, we report an empirical validation that confirms our theoretical findings, provides a thorough comparison to alternative approaches, and further supports the interest of this setting in practice. } }
Endnote
%0 Conference Paper %T Top K Ranking for Multi-Armed Bandit with Noisy Evaluations %A Evrard Garcelon %A Vashist Avadhanula %A Alessandro Lazaric %A Matteo Pirotta %B Proceedings of The 25th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2022 %E Gustau Camps-Valls %E Francisco J. R. Ruiz %E Isabel Valera %F pmlr-v151-garcelon22b %I PMLR %P 6242--6269 %U https://proceedings.mlr.press/v151/garcelon22b.html %V 151 %X We consider a multi-armed bandit setting where, at the beginning of each round, the learner receives noisy independent, and possibly biased, evaluations of the true reward of each arm and it selects $K$ arms with the objective of accumulating as much reward as possible over $T$ rounds. Under the assumption that at each round the true reward of each arm is drawn from a fixed distribution, we derive different algorithmic approaches and theoretical guarantees depending on how the evaluations are generated. First, we show a $\widetilde{O}(T^{2/3})$ regret in the general case when the observation functions are a genearalized linear function of the true rewards. On the other hand, we show that an improved $\widetilde{O}(\sqrt{T})$ regret can be derived when the observation functions are noisy linear functions of the true rewards. Finally, we report an empirical validation that confirms our theoretical findings, provides a thorough comparison to alternative approaches, and further supports the interest of this setting in practice.
APA
Garcelon, E., Avadhanula, V., Lazaric, A. & Pirotta, M.. (2022). Top K Ranking for Multi-Armed Bandit with Noisy Evaluations . Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 151:6242-6269 Available from https://proceedings.mlr.press/v151/garcelon22b.html.

Related Material