Improved Offline Contextual Bandits with Second-Order Bounds: Betting and Freezing

J. Jon Ryu, Jeongyeol Kwon, Benjamin Koppe, Kwang-Sung Jun
Proceedings of Thirty Eighth Conference on Learning Theory, PMLR 291:5015-5053, 2025.

Abstract

We consider off-policy selection and learning in contextual bandits, where the learner aims to select or train a reward-maximizing policy using data collected by a fixed behavior policy. Our contribution is two-fold. First, we propose a novel off-policy selection method that leverages a new betting-based confidence bound applied to an inverse propensity weight sequence. Our theoretical analysis reveals that this method achieves a significantly improved, variance-adaptive guarantee over prior work. Second, we propose a novel and generic condition on the optimization objective for off-policy learning that strikes a different balance between bias and variance. One special case, which we call freezing, tends to induce low variance, which is preferred in small-data regimes. Our analysis shows that it matches the best existing guarantees. In our empirical study, our selection method outperforms existing methods, and freezing exhibits improved performance in small-sample regimes.

Cite this Paper


BibTeX
@InProceedings{pmlr-v291-ryu25a, title = {Improved Offline Contextual Bandits with Second-Order Bounds: Betting and Freezing}, author = {Ryu, J. Jon and Kwon, Jeongyeol and Koppe, Benjamin and Jun, Kwang-Sung}, booktitle = {Proceedings of Thirty Eighth Conference on Learning Theory}, pages = {5015--5053}, year = {2025}, editor = {Haghtalab, Nika and Moitra, Ankur}, volume = {291}, series = {Proceedings of Machine Learning Research}, month = {30 Jun--04 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v291/main/assets/ryu25a/ryu25a.pdf}, url = {https://proceedings.mlr.press/v291/ryu25a.html}, abstract = {We consider off-policy selection and learning in contextual bandits, where the learner aims to select or train a reward-maximizing policy using data collected by a fixed behavior policy. Our contribution is two-fold. First, we propose a novel off-policy selection method that leverages a new betting-based confidence bound applied to an inverse propensity weight sequence. Our theoretical analysis reveals that this method achieves a significantly improved, variance-adaptive guarantee over prior work. Second, we propose a novel and generic condition on the optimization objective for off-policy learning that strikes a different balance between bias and variance. One special case, which we call freezing, tends to induce low variance, which is preferred in small-data regimes. Our analysis shows that it matches the best existing guarantees. In our empirical study, our selection method outperforms existing methods, and freezing exhibits improved performance in small-sample regimes.} }
Endnote
%0 Conference Paper %T Improved Offline Contextual Bandits with Second-Order Bounds: Betting and Freezing %A J. Jon Ryu %A Jeongyeol Kwon %A Benjamin Koppe %A Kwang-Sung Jun %B Proceedings of Thirty Eighth Conference on Learning Theory %C Proceedings of Machine Learning Research %D 2025 %E Nika Haghtalab %E Ankur Moitra %F pmlr-v291-ryu25a %I PMLR %P 5015--5053 %U https://proceedings.mlr.press/v291/ryu25a.html %V 291 %X We consider off-policy selection and learning in contextual bandits, where the learner aims to select or train a reward-maximizing policy using data collected by a fixed behavior policy. Our contribution is two-fold. First, we propose a novel off-policy selection method that leverages a new betting-based confidence bound applied to an inverse propensity weight sequence. Our theoretical analysis reveals that this method achieves a significantly improved, variance-adaptive guarantee over prior work. Second, we propose a novel and generic condition on the optimization objective for off-policy learning that strikes a different balance between bias and variance. One special case, which we call freezing, tends to induce low variance, which is preferred in small-data regimes. Our analysis shows that it matches the best existing guarantees. In our empirical study, our selection method outperforms existing methods, and freezing exhibits improved performance in small-sample regimes.
APA
Ryu, J.J., Kwon, J., Koppe, B. & Jun, K.. (2025). Improved Offline Contextual Bandits with Second-Order Bounds: Betting and Freezing. Proceedings of Thirty Eighth Conference on Learning Theory, in Proceedings of Machine Learning Research 291:5015-5053 Available from https://proceedings.mlr.press/v291/ryu25a.html.

Related Material