Online learning in bandits with predicted context

Yongyi Guo, Ziping Xu, Susan Murphy
Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, PMLR 238:2215-2223, 2024.

Abstract

We consider the contextual bandit problem where at each time, the agent only has access to a noisy version of the context and the error variance (or an estimator of this variance). This setting is motivated by a wide range of applications where the true context for decision-making is unobserved, and only a prediction of the context by a potentially complex machine learning algorithm is available. When the context error is non-vanishing, classical bandit algorithms fail to achieve sublinear regret. We propose the first online algorithm in this setting with sublinear regret guarantees under mild conditions. The key idea is to extend the measurement error model in classical statistics to the online decision-making setting, which is nontrivial due to the policy being dependent on the noisy context observations. We further demonstrate the benefits of the proposed approach in simulation environments based on synthetic and real digital intervention datasets.

Cite this Paper


BibTeX
@InProceedings{pmlr-v238-guo24b, title = { Online learning in bandits with predicted context }, author = {Guo, Yongyi and Xu, Ziping and Murphy, Susan}, booktitle = {Proceedings of The 27th International Conference on Artificial Intelligence and Statistics}, pages = {2215--2223}, year = {2024}, editor = {Dasgupta, Sanjoy and Mandt, Stephan and Li, Yingzhen}, volume = {238}, series = {Proceedings of Machine Learning Research}, month = {02--04 May}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v238/guo24b/guo24b.pdf}, url = {https://proceedings.mlr.press/v238/guo24b.html}, abstract = { We consider the contextual bandit problem where at each time, the agent only has access to a noisy version of the context and the error variance (or an estimator of this variance). This setting is motivated by a wide range of applications where the true context for decision-making is unobserved, and only a prediction of the context by a potentially complex machine learning algorithm is available. When the context error is non-vanishing, classical bandit algorithms fail to achieve sublinear regret. We propose the first online algorithm in this setting with sublinear regret guarantees under mild conditions. The key idea is to extend the measurement error model in classical statistics to the online decision-making setting, which is nontrivial due to the policy being dependent on the noisy context observations. We further demonstrate the benefits of the proposed approach in simulation environments based on synthetic and real digital intervention datasets. } }
Endnote
%0 Conference Paper %T Online learning in bandits with predicted context %A Yongyi Guo %A Ziping Xu %A Susan Murphy %B Proceedings of The 27th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2024 %E Sanjoy Dasgupta %E Stephan Mandt %E Yingzhen Li %F pmlr-v238-guo24b %I PMLR %P 2215--2223 %U https://proceedings.mlr.press/v238/guo24b.html %V 238 %X We consider the contextual bandit problem where at each time, the agent only has access to a noisy version of the context and the error variance (or an estimator of this variance). This setting is motivated by a wide range of applications where the true context for decision-making is unobserved, and only a prediction of the context by a potentially complex machine learning algorithm is available. When the context error is non-vanishing, classical bandit algorithms fail to achieve sublinear regret. We propose the first online algorithm in this setting with sublinear regret guarantees under mild conditions. The key idea is to extend the measurement error model in classical statistics to the online decision-making setting, which is nontrivial due to the policy being dependent on the noisy context observations. We further demonstrate the benefits of the proposed approach in simulation environments based on synthetic and real digital intervention datasets.
APA
Guo, Y., Xu, Z. & Murphy, S.. (2024). Online learning in bandits with predicted context . Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 238:2215-2223 Available from https://proceedings.mlr.press/v238/guo24b.html.

Related Material