Bayesian Regret Minimization in Offline Bandits

Marek Petrik, Guy Tennenholtz, Mohammad Ghavamzadeh
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:40502-40522, 2024.

Abstract

We study how to make decisions that minimize Bayesian regret in offline linear bandits. Prior work suggests that one must take actions with maximum lower confidence bound (LCB) on their reward. We argue that reliance on LCB is inherently flawed in this setting and propose a new algorithm that directly minimizes upper-bounds on the Bayesian regret using efficient conic optimization solvers. Our bounds build heavily on new connections to monetary risk measures. Proving a matching lower-bound, we show that our upper-bounds are tight, and by minimizing them we are guaranteed to outperform the LCB approach. Our numerical results on synthetic domains confirm that our approach is superior to maximizing LCB.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-petrik24a, title = {{B}ayesian Regret Minimization in Offline Bandits}, author = {Petrik, Marek and Tennenholtz, Guy and Ghavamzadeh, Mohammad}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {40502--40522}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/petrik24a/petrik24a.pdf}, url = {https://proceedings.mlr.press/v235/petrik24a.html}, abstract = {We study how to make decisions that minimize Bayesian regret in offline linear bandits. Prior work suggests that one must take actions with maximum lower confidence bound (LCB) on their reward. We argue that reliance on LCB is inherently flawed in this setting and propose a new algorithm that directly minimizes upper-bounds on the Bayesian regret using efficient conic optimization solvers. Our bounds build heavily on new connections to monetary risk measures. Proving a matching lower-bound, we show that our upper-bounds are tight, and by minimizing them we are guaranteed to outperform the LCB approach. Our numerical results on synthetic domains confirm that our approach is superior to maximizing LCB.} }
Endnote
%0 Conference Paper %T Bayesian Regret Minimization in Offline Bandits %A Marek Petrik %A Guy Tennenholtz %A Mohammad Ghavamzadeh %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-petrik24a %I PMLR %P 40502--40522 %U https://proceedings.mlr.press/v235/petrik24a.html %V 235 %X We study how to make decisions that minimize Bayesian regret in offline linear bandits. Prior work suggests that one must take actions with maximum lower confidence bound (LCB) on their reward. We argue that reliance on LCB is inherently flawed in this setting and propose a new algorithm that directly minimizes upper-bounds on the Bayesian regret using efficient conic optimization solvers. Our bounds build heavily on new connections to monetary risk measures. Proving a matching lower-bound, we show that our upper-bounds are tight, and by minimizing them we are guaranteed to outperform the LCB approach. Our numerical results on synthetic domains confirm that our approach is superior to maximizing LCB.
APA
Petrik, M., Tennenholtz, G. & Ghavamzadeh, M.. (2024). Bayesian Regret Minimization in Offline Bandits. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:40502-40522 Available from https://proceedings.mlr.press/v235/petrik24a.html.

Related Material