CORe: Capitalizing On Rewards in Bandit Exploration

Nan Wang, Branislav Kveton, Maryam Karimzadehgan
Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence, PMLR 161:1968-1978, 2021.

Abstract

We propose a bandit algorithm that explores purely by randomizing its past observations. In particular, the sufficient optimism in the mean reward estimates is achieved by exploiting the variance in the past observed rewards. We name the algorithm Capitalizing On Rewards (CORe). The algorithm is general and can be easily applied to different bandit settings. The main benefit of CORe is that its exploration is fully data-dependent. It does not rely on any external noise and adapts to different problems without parameter tuning. We derive a $\tilde O(d\sqrt{n\log K})$ gap-free bound on the n-round regret of CORe in a stochastic linear bandit, where d is the number of features and K is the number of arms. Extensive empirical evaluation on multiple synthetic and real-world problems demonstrates the effectiveness of CORe.

Cite this Paper


BibTeX
@InProceedings{pmlr-v161-wang21c, title = {CORe: Capitalizing On Rewards in Bandit Exploration}, author = {Wang, Nan and Kveton, Branislav and Karimzadehgan, Maryam}, booktitle = {Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence}, pages = {1968--1978}, year = {2021}, editor = {de Campos, Cassio and Maathuis, Marloes H.}, volume = {161}, series = {Proceedings of Machine Learning Research}, month = {27--30 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v161/wang21c/wang21c.pdf}, url = {https://proceedings.mlr.press/v161/wang21c.html}, abstract = {We propose a bandit algorithm that explores purely by randomizing its past observations. In particular, the sufficient optimism in the mean reward estimates is achieved by exploiting the variance in the past observed rewards. We name the algorithm Capitalizing On Rewards (CORe). The algorithm is general and can be easily applied to different bandit settings. The main benefit of CORe is that its exploration is fully data-dependent. It does not rely on any external noise and adapts to different problems without parameter tuning. We derive a $\tilde O(d\sqrt{n\log K})$ gap-free bound on the n-round regret of CORe in a stochastic linear bandit, where d is the number of features and K is the number of arms. Extensive empirical evaluation on multiple synthetic and real-world problems demonstrates the effectiveness of CORe.} }
Endnote
%0 Conference Paper %T CORe: Capitalizing On Rewards in Bandit Exploration %A Nan Wang %A Branislav Kveton %A Maryam Karimzadehgan %B Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence %C Proceedings of Machine Learning Research %D 2021 %E Cassio de Campos %E Marloes H. Maathuis %F pmlr-v161-wang21c %I PMLR %P 1968--1978 %U https://proceedings.mlr.press/v161/wang21c.html %V 161 %X We propose a bandit algorithm that explores purely by randomizing its past observations. In particular, the sufficient optimism in the mean reward estimates is achieved by exploiting the variance in the past observed rewards. We name the algorithm Capitalizing On Rewards (CORe). The algorithm is general and can be easily applied to different bandit settings. The main benefit of CORe is that its exploration is fully data-dependent. It does not rely on any external noise and adapts to different problems without parameter tuning. We derive a $\tilde O(d\sqrt{n\log K})$ gap-free bound on the n-round regret of CORe in a stochastic linear bandit, where d is the number of features and K is the number of arms. Extensive empirical evaluation on multiple synthetic and real-world problems demonstrates the effectiveness of CORe.
APA
Wang, N., Kveton, B. & Karimzadehgan, M.. (2021). CORe: Capitalizing On Rewards in Bandit Exploration. Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence, in Proceedings of Machine Learning Research 161:1968-1978 Available from https://proceedings.mlr.press/v161/wang21c.html.

Related Material