SPEED: Experimental Design for Policy Evaluation in Linear Heteroscedastic Bandits

Subhojyoti Mukherjee, Qiaomin Xie, Josiah P Hanna, Robert Nowak
Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, PMLR 238:2962-2970, 2024.

Abstract

In this paper, we study the problem of optimal data collection for policy evaluation in linear bandits. In policy evaluation, we are given a \textit{target} policy and asked to estimate the expected reward it will obtain when executed in a multi-armed bandit environment. Our work is the first work that focuses on such an optimal data collection strategy for policy evaluation involving heteroscedastic reward noise in the linear bandit setting. We first formulate an optimal design for weighted least squares estimates in the heteroscedastic linear bandit setting with the knowledge of noise variances. This design minimizes the mean squared error (MSE) of the estimated value of the target policy and is termed the oracle design. Since the noise variance is typically unknown, we then introduce a novel algorithm, SPEED (\textbf{S}tructured \textbf{P}olicy \textbf{E}valuation \textbf{E}xperimental \textbf{D}esign), that tracks the oracle design and derive its regret with respect to the oracle design. We show that regret scales as $\widetilde{O}_{}(d^3 n^{-3/2})$ and prove a matching lower bound of $\Omega(d^2 n^{-3/2})$. Finally, we evaluate SPEED on a set of policy evaluation tasks and demonstrate that it achieves MSE comparable to an optimal oracle and much lower than simply running the target policy.

Cite this Paper


BibTeX
@InProceedings{pmlr-v238-mukherjee24a, title = {SPEED: Experimental Design for Policy Evaluation in Linear Heteroscedastic Bandits}, author = {Mukherjee, Subhojyoti and Xie, Qiaomin and P Hanna, Josiah and Nowak, Robert}, booktitle = {Proceedings of The 27th International Conference on Artificial Intelligence and Statistics}, pages = {2962--2970}, year = {2024}, editor = {Dasgupta, Sanjoy and Mandt, Stephan and Li, Yingzhen}, volume = {238}, series = {Proceedings of Machine Learning Research}, month = {02--04 May}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v238/mukherjee24a/mukherjee24a.pdf}, url = {https://proceedings.mlr.press/v238/mukherjee24a.html}, abstract = {In this paper, we study the problem of optimal data collection for policy evaluation in linear bandits. In policy evaluation, we are given a \textit{target} policy and asked to estimate the expected reward it will obtain when executed in a multi-armed bandit environment. Our work is the first work that focuses on such an optimal data collection strategy for policy evaluation involving heteroscedastic reward noise in the linear bandit setting. We first formulate an optimal design for weighted least squares estimates in the heteroscedastic linear bandit setting with the knowledge of noise variances. This design minimizes the mean squared error (MSE) of the estimated value of the target policy and is termed the oracle design. Since the noise variance is typically unknown, we then introduce a novel algorithm, SPEED (\textbf{S}tructured \textbf{P}olicy \textbf{E}valuation \textbf{E}xperimental \textbf{D}esign), that tracks the oracle design and derive its regret with respect to the oracle design. We show that regret scales as $\widetilde{O}_{}(d^3 n^{-3/2})$ and prove a matching lower bound of $\Omega(d^2 n^{-3/2})$. Finally, we evaluate SPEED on a set of policy evaluation tasks and demonstrate that it achieves MSE comparable to an optimal oracle and much lower than simply running the target policy.} }
Endnote
%0 Conference Paper %T SPEED: Experimental Design for Policy Evaluation in Linear Heteroscedastic Bandits %A Subhojyoti Mukherjee %A Qiaomin Xie %A Josiah P Hanna %A Robert Nowak %B Proceedings of The 27th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2024 %E Sanjoy Dasgupta %E Stephan Mandt %E Yingzhen Li %F pmlr-v238-mukherjee24a %I PMLR %P 2962--2970 %U https://proceedings.mlr.press/v238/mukherjee24a.html %V 238 %X In this paper, we study the problem of optimal data collection for policy evaluation in linear bandits. In policy evaluation, we are given a \textit{target} policy and asked to estimate the expected reward it will obtain when executed in a multi-armed bandit environment. Our work is the first work that focuses on such an optimal data collection strategy for policy evaluation involving heteroscedastic reward noise in the linear bandit setting. We first formulate an optimal design for weighted least squares estimates in the heteroscedastic linear bandit setting with the knowledge of noise variances. This design minimizes the mean squared error (MSE) of the estimated value of the target policy and is termed the oracle design. Since the noise variance is typically unknown, we then introduce a novel algorithm, SPEED (\textbf{S}tructured \textbf{P}olicy \textbf{E}valuation \textbf{E}xperimental \textbf{D}esign), that tracks the oracle design and derive its regret with respect to the oracle design. We show that regret scales as $\widetilde{O}_{}(d^3 n^{-3/2})$ and prove a matching lower bound of $\Omega(d^2 n^{-3/2})$. Finally, we evaluate SPEED on a set of policy evaluation tasks and demonstrate that it achieves MSE comparable to an optimal oracle and much lower than simply running the target policy.
APA
Mukherjee, S., Xie, Q., P Hanna, J. & Nowak, R.. (2024). SPEED: Experimental Design for Policy Evaluation in Linear Heteroscedastic Bandits. Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 238:2962-2970 Available from https://proceedings.mlr.press/v238/mukherjee24a.html.

Related Material