[edit]

# SPEED: Experimental Design for Policy Evaluation in Linear Heteroscedastic Bandits

*Proceedings of The 27th International Conference on Artificial Intelligence and Statistics*, PMLR 238:2962-2970, 2024.

#### Abstract

In this paper, we study the problem of optimal data collection for policy evaluation in linear bandits. In policy evaluation, we are given a \textit{target} policy and asked to estimate the expected reward it will obtain when executed in a multi-armed bandit environment. Our work is the first work that focuses on such an optimal data collection strategy for policy evaluation involving heteroscedastic reward noise in the linear bandit setting. We first formulate an optimal design for weighted least squares estimates in the heteroscedastic linear bandit setting with the knowledge of noise variances. This design minimizes the mean squared error (MSE) of the estimated value of the target policy and is termed the oracle design. Since the noise variance is typically unknown, we then introduce a novel algorithm, SPEED (\textbf{S}tructured \textbf{P}olicy \textbf{E}valuation \textbf{E}xperimental \textbf{D}esign), that tracks the oracle design and derive its regret with respect to the oracle design. We show that regret scales as $\widetilde{O}_{}(d^3 n^{-3/2})$ and prove a matching lower bound of $\Omega(d^2 n^{-3/2})$. Finally, we evaluate SPEED on a set of policy evaluation tasks and demonstrate that it achieves MSE comparable to an optimal oracle and much lower than simply running the target policy.