Exploration in Linear Bandits with Rich Action Sets and its Implications for Inference

Debangshu Banerjee, Avishek Ghosh, Sayak Ray Chowdhury, Aditya Gopalan
Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, PMLR 206:8233-8262, 2023.

Abstract

We present a non-asymptotic lower bound on the spectrum of the design matrix generated by any linear bandit algorithm with sub-linear regret when the action set has well-behaved curvature. Specifically, we show that the minimum eigenvalue of the expected design matrix grows as $\Omega(\sqrt{n})$ whenever the expected cumulative regret of the algorithm is $O(\sqrt{n})$, where $n$ is the learning horizon, and the action-space has a constant Hessian around the optimal arm. This shows that such action-spaces force a polynomial lower bound on the least eigenvalue, rather than a logarithmic lower bound as shown by Lattimore et al. (2017) for discrete (i.e., well-separated) action spaces. Furthermore, while the latter holds only in the asymptotic regime ($n \to \infty$), our result for these “locally rich” action spaces is any-time. Additionally, under a mild technical assumption, we obtain a similar lower bound on the minimum eigen value holding with high probability. We apply our result to two practical scenarios – model selection and clustering in linear bandits. For model selection, we show that an epoch-based linear bandit algorithm adapts to the true model complexity at a rate exponential in the number of epochs, by virtue of our novel spectral bound. For clustering, we consider a multi agent framework where we show, by leveraging the spectral result, that no forced exploration is necessary—the agents can run a linear bandit algorithm and estimate their underlying parameters at once, and hence incur a low regret.

Cite this Paper


BibTeX
@InProceedings{pmlr-v206-banerjee23b, title = {Exploration in Linear Bandits with Rich Action Sets and its Implications for Inference}, author = {Banerjee, Debangshu and Ghosh, Avishek and Ray Chowdhury, Sayak and Gopalan, Aditya}, booktitle = {Proceedings of The 26th International Conference on Artificial Intelligence and Statistics}, pages = {8233--8262}, year = {2023}, editor = {Ruiz, Francisco and Dy, Jennifer and van de Meent, Jan-Willem}, volume = {206}, series = {Proceedings of Machine Learning Research}, month = {25--27 Apr}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v206/banerjee23b/banerjee23b.pdf}, url = {https://proceedings.mlr.press/v206/banerjee23b.html}, abstract = {We present a non-asymptotic lower bound on the spectrum of the design matrix generated by any linear bandit algorithm with sub-linear regret when the action set has well-behaved curvature. Specifically, we show that the minimum eigenvalue of the expected design matrix grows as $\Omega(\sqrt{n})$ whenever the expected cumulative regret of the algorithm is $O(\sqrt{n})$, where $n$ is the learning horizon, and the action-space has a constant Hessian around the optimal arm. This shows that such action-spaces force a polynomial lower bound on the least eigenvalue, rather than a logarithmic lower bound as shown by Lattimore et al. (2017) for discrete (i.e., well-separated) action spaces. Furthermore, while the latter holds only in the asymptotic regime ($n \to \infty$), our result for these “locally rich” action spaces is any-time. Additionally, under a mild technical assumption, we obtain a similar lower bound on the minimum eigen value holding with high probability. We apply our result to two practical scenarios – model selection and clustering in linear bandits. For model selection, we show that an epoch-based linear bandit algorithm adapts to the true model complexity at a rate exponential in the number of epochs, by virtue of our novel spectral bound. For clustering, we consider a multi agent framework where we show, by leveraging the spectral result, that no forced exploration is necessary—the agents can run a linear bandit algorithm and estimate their underlying parameters at once, and hence incur a low regret.} }
Endnote
%0 Conference Paper %T Exploration in Linear Bandits with Rich Action Sets and its Implications for Inference %A Debangshu Banerjee %A Avishek Ghosh %A Sayak Ray Chowdhury %A Aditya Gopalan %B Proceedings of The 26th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2023 %E Francisco Ruiz %E Jennifer Dy %E Jan-Willem van de Meent %F pmlr-v206-banerjee23b %I PMLR %P 8233--8262 %U https://proceedings.mlr.press/v206/banerjee23b.html %V 206 %X We present a non-asymptotic lower bound on the spectrum of the design matrix generated by any linear bandit algorithm with sub-linear regret when the action set has well-behaved curvature. Specifically, we show that the minimum eigenvalue of the expected design matrix grows as $\Omega(\sqrt{n})$ whenever the expected cumulative regret of the algorithm is $O(\sqrt{n})$, where $n$ is the learning horizon, and the action-space has a constant Hessian around the optimal arm. This shows that such action-spaces force a polynomial lower bound on the least eigenvalue, rather than a logarithmic lower bound as shown by Lattimore et al. (2017) for discrete (i.e., well-separated) action spaces. Furthermore, while the latter holds only in the asymptotic regime ($n \to \infty$), our result for these “locally rich” action spaces is any-time. Additionally, under a mild technical assumption, we obtain a similar lower bound on the minimum eigen value holding with high probability. We apply our result to two practical scenarios – model selection and clustering in linear bandits. For model selection, we show that an epoch-based linear bandit algorithm adapts to the true model complexity at a rate exponential in the number of epochs, by virtue of our novel spectral bound. For clustering, we consider a multi agent framework where we show, by leveraging the spectral result, that no forced exploration is necessary—the agents can run a linear bandit algorithm and estimate their underlying parameters at once, and hence incur a low regret.
APA
Banerjee, D., Ghosh, A., Ray Chowdhury, S. & Gopalan, A.. (2023). Exploration in Linear Bandits with Rich Action Sets and its Implications for Inference. Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 206:8233-8262 Available from https://proceedings.mlr.press/v206/banerjee23b.html.

Related Material