Multi-armed bandits with guaranteed revenue per arm

Dorian Baudry, Nadav Merlis, Mathieu Benjamin Molina, Hugo Richard, Vianney Perchet
Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, PMLR 238:379-387, 2024.

Abstract

We consider a Multi-Armed Bandit problem with covering constraints, where the primary goal is to ensure that each arm receives a minimum expected reward while maximizing the total cumulative reward. In this scenario, the optimal policy then belongs to some unknown feasible set. Unlike much of the existing literature, we do not assume the presence of a safe policy or a feasibility margin, which hinders the exclusive use of conservative approaches. Consequently, we propose and analyze an algorithm that switches between pessimism and optimism in the face of uncertainty. We prove both precise problem-dependent and problem-independent bounds, demonstrating that our algorithm achieves the best of the two approaches—depending on the presence or absence of a feasibility margin—in terms of constraint violation guarantees. Furthermore, our results indicate that playing greedily on the constraints actually outperforms pessimism when considering long-term violations rather than violations on a per-round basis.

Cite this Paper


BibTeX
@InProceedings{pmlr-v238-baudry24a, title = {Multi-armed bandits with guaranteed revenue per arm}, author = {Baudry, Dorian and Merlis, Nadav and Benjamin Molina, Mathieu and Richard, Hugo and Perchet, Vianney}, booktitle = {Proceedings of The 27th International Conference on Artificial Intelligence and Statistics}, pages = {379--387}, year = {2024}, editor = {Dasgupta, Sanjoy and Mandt, Stephan and Li, Yingzhen}, volume = {238}, series = {Proceedings of Machine Learning Research}, month = {02--04 May}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v238/baudry24a/baudry24a.pdf}, url = {https://proceedings.mlr.press/v238/baudry24a.html}, abstract = {We consider a Multi-Armed Bandit problem with covering constraints, where the primary goal is to ensure that each arm receives a minimum expected reward while maximizing the total cumulative reward. In this scenario, the optimal policy then belongs to some unknown feasible set. Unlike much of the existing literature, we do not assume the presence of a safe policy or a feasibility margin, which hinders the exclusive use of conservative approaches. Consequently, we propose and analyze an algorithm that switches between pessimism and optimism in the face of uncertainty. We prove both precise problem-dependent and problem-independent bounds, demonstrating that our algorithm achieves the best of the two approaches—depending on the presence or absence of a feasibility margin—in terms of constraint violation guarantees. Furthermore, our results indicate that playing greedily on the constraints actually outperforms pessimism when considering long-term violations rather than violations on a per-round basis.} }
Endnote
%0 Conference Paper %T Multi-armed bandits with guaranteed revenue per arm %A Dorian Baudry %A Nadav Merlis %A Mathieu Benjamin Molina %A Hugo Richard %A Vianney Perchet %B Proceedings of The 27th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2024 %E Sanjoy Dasgupta %E Stephan Mandt %E Yingzhen Li %F pmlr-v238-baudry24a %I PMLR %P 379--387 %U https://proceedings.mlr.press/v238/baudry24a.html %V 238 %X We consider a Multi-Armed Bandit problem with covering constraints, where the primary goal is to ensure that each arm receives a minimum expected reward while maximizing the total cumulative reward. In this scenario, the optimal policy then belongs to some unknown feasible set. Unlike much of the existing literature, we do not assume the presence of a safe policy or a feasibility margin, which hinders the exclusive use of conservative approaches. Consequently, we propose and analyze an algorithm that switches between pessimism and optimism in the face of uncertainty. We prove both precise problem-dependent and problem-independent bounds, demonstrating that our algorithm achieves the best of the two approaches—depending on the presence or absence of a feasibility margin—in terms of constraint violation guarantees. Furthermore, our results indicate that playing greedily on the constraints actually outperforms pessimism when considering long-term violations rather than violations on a per-round basis.
APA
Baudry, D., Merlis, N., Benjamin Molina, M., Richard, H. & Perchet, V.. (2024). Multi-armed bandits with guaranteed revenue per arm. Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 238:379-387 Available from https://proceedings.mlr.press/v238/baudry24a.html.

Related Material