Adapting to misspecification in contextual bandits with offline regression oracles

Sanath Kumar Krishnamurthy, Vitor Hadad, Susan Athey
Proceedings of the 38th International Conference on Machine Learning, PMLR 139:5805-5814, 2021.

Abstract

Computationally efficient contextual bandits are often based on estimating a predictive model of rewards given contexts and arms using past data. However, when the reward model is not well-specified, the bandit algorithm may incur unexpected regret, so recent work has focused on algorithms that are robust to misspecification. We propose a simple family of contextual bandit algorithms that adapt to misspecification error by reverting to a good safe policy when there is evidence that misspecification is causing a regret increase. Our algorithm requires only an offline regression oracle to ensure regret guarantees that gracefully degrade in terms of a measure of the average misspecification level. Compared to prior work, we attain similar regret guarantees, but we do no rely on a master algorithm, and do not require more robust oracles like online or constrained regression oracles (e.g., Foster et al. (2020), Krishnamurthy et al. (2020)). This allows us to design algorithms for more general function approximation classes.

Cite this Paper


BibTeX
@InProceedings{pmlr-v139-krishnamurthy21a, title = {Adapting to misspecification in contextual bandits with offline regression oracles}, author = {Krishnamurthy, Sanath Kumar and Hadad, Vitor and Athey, Susan}, booktitle = {Proceedings of the 38th International Conference on Machine Learning}, pages = {5805--5814}, year = {2021}, editor = {Meila, Marina and Zhang, Tong}, volume = {139}, series = {Proceedings of Machine Learning Research}, month = {18--24 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v139/krishnamurthy21a/krishnamurthy21a.pdf}, url = {https://proceedings.mlr.press/v139/krishnamurthy21a.html}, abstract = {Computationally efficient contextual bandits are often based on estimating a predictive model of rewards given contexts and arms using past data. However, when the reward model is not well-specified, the bandit algorithm may incur unexpected regret, so recent work has focused on algorithms that are robust to misspecification. We propose a simple family of contextual bandit algorithms that adapt to misspecification error by reverting to a good safe policy when there is evidence that misspecification is causing a regret increase. Our algorithm requires only an offline regression oracle to ensure regret guarantees that gracefully degrade in terms of a measure of the average misspecification level. Compared to prior work, we attain similar regret guarantees, but we do no rely on a master algorithm, and do not require more robust oracles like online or constrained regression oracles (e.g., Foster et al. (2020), Krishnamurthy et al. (2020)). This allows us to design algorithms for more general function approximation classes.} }
Endnote
%0 Conference Paper %T Adapting to misspecification in contextual bandits with offline regression oracles %A Sanath Kumar Krishnamurthy %A Vitor Hadad %A Susan Athey %B Proceedings of the 38th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2021 %E Marina Meila %E Tong Zhang %F pmlr-v139-krishnamurthy21a %I PMLR %P 5805--5814 %U https://proceedings.mlr.press/v139/krishnamurthy21a.html %V 139 %X Computationally efficient contextual bandits are often based on estimating a predictive model of rewards given contexts and arms using past data. However, when the reward model is not well-specified, the bandit algorithm may incur unexpected regret, so recent work has focused on algorithms that are robust to misspecification. We propose a simple family of contextual bandit algorithms that adapt to misspecification error by reverting to a good safe policy when there is evidence that misspecification is causing a regret increase. Our algorithm requires only an offline regression oracle to ensure regret guarantees that gracefully degrade in terms of a measure of the average misspecification level. Compared to prior work, we attain similar regret guarantees, but we do no rely on a master algorithm, and do not require more robust oracles like online or constrained regression oracles (e.g., Foster et al. (2020), Krishnamurthy et al. (2020)). This allows us to design algorithms for more general function approximation classes.
APA
Krishnamurthy, S.K., Hadad, V. & Athey, S.. (2021). Adapting to misspecification in contextual bandits with offline regression oracles. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:5805-5814 Available from https://proceedings.mlr.press/v139/krishnamurthy21a.html.

Related Material