Variance-Dependent Regret Bounds for Linear Bandits and Reinforcement Learning: Adaptivity and Computational Efficiency

Heyang Zhao; Jiafan He; Dongruo Zhou; Tong Zhang; Quanquan Gu

Variance-Dependent Regret Bounds for Linear Bandits and Reinforcement Learning: Adaptivity and Computational Efficiency

Heyang Zhao, Jiafan He, Dongruo Zhou, Tong Zhang, Quanquan Gu

Proceedings of Thirty Sixth Conference on Learning Theory, PMLR 195:4977-5020, 2023.

Abstract

Recently, several studies \citep{zhou2021nearly, zhang2021variance, kim2021improved, zhou2022computationally} have provided variance-dependent regret bounds for linear contextual bandits, which interpolates the regret for the worst-case regime and the deterministic reward regime. However, these algorithms are either computationally intractable or unable to handle unknown variance of the noise. In this paper, we present a novel solution to this open problem by proposing the \emph{first computationally efficient} algorithm for linear bandits with heteroscedastic noise. Our algorithm is adaptive to the unknown variance of noise and achieves an $\tilde{O}(d \sqrt{\sum_{k = 1}^K \sigma_k^2} + d)$ regret, where $\sigma_k^2$ is the \emph{variance} of the noise at the round $k$, $d$ is the dimension of the contexts and $K$ is the total number of rounds. Our results are based on an adaptive variance-aware confidence set enabled by a new Freedman-type concentration inequality for self-normalized martingales and a multi-layer structure to stratify the context vectors into different layers with different uniform upper bounds on the uncertainty. Furthermore, our approach can be extended to linear mixture Markov decision processes (MDPs) in reinforcement learning. We propose a variance-adaptive algorithm for linear mixture MDPs, which achieves a problem-dependent horizon-free regret bound that can gracefully reduce to a nearly constant regret for deterministic MDPs. Unlike existing nearly minimax optimal algorithms for linear mixture MDPs, our algorithm does not require explicit variance estimation of the transitional probabilities or the use of high-order moment estimators to attain horizon-free regret. We believe the techniques developed in this paper can have independent value for general online decision making problems.

Cite this Paper

BibTeX

@InProceedings{pmlr-v195-zhao23a,
  title = 	 {Variance-Dependent Regret Bounds for Linear Bandits and Reinforcement Learning: Adaptivity and Computational Efficiency},
  author =       {Zhao, Heyang and He, Jiafan and Zhou, Dongruo and Zhang, Tong and Gu, Quanquan},
  booktitle = 	 {Proceedings of Thirty Sixth Conference on Learning Theory},
  pages = 	 {4977--5020},
  year = 	 {2023},
  editor = 	 {Neu, Gergely and Rosasco, Lorenzo},
  volume = 	 {195},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {12--15 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v195/zhao23a/zhao23a.pdf},
  url = 	 {https://proceedings.mlr.press/v195/zhao23a.html},
  abstract = 	 {Recently, several studies \citep{zhou2021nearly, zhang2021variance, kim2021improved, zhou2022computationally} have provided variance-dependent regret bounds for linear contextual bandits, which interpolates the regret for the worst-case regime and the deterministic reward regime. However, these algorithms are either computationally intractable or unable to handle unknown variance of the noise. In this paper, we present a novel solution to this open problem by proposing the \emph{first computationally efficient} algorithm for linear bandits with heteroscedastic noise. Our algorithm is adaptive to the unknown variance of noise and achieves an $\tilde{O}(d \sqrt{\sum_{k = 1}^K \sigma_k^2} + d)$  regret, where $\sigma_k^2$ is the \emph{variance} of the noise at the round $k$, $d$ is the dimension of the contexts and $K$ is the total number of rounds. Our results are based on an adaptive variance-aware confidence set enabled by a new Freedman-type concentration inequality for self-normalized martingales and a multi-layer structure to stratify the context vectors into different layers with different uniform upper bounds on the uncertainty.    Furthermore, our approach can be extended to linear mixture Markov decision processes (MDPs) in reinforcement learning. We propose a variance-adaptive algorithm for linear mixture MDPs, which achieves a problem-dependent horizon-free regret bound that can gracefully reduce to a nearly constant regret for deterministic MDPs. Unlike existing nearly minimax optimal algorithms for linear mixture MDPs, our algorithm does not require explicit variance estimation of the transitional probabilities or the use of high-order moment estimators to attain horizon-free regret. We believe the techniques developed in this paper can have independent value for general online decision making problems.}
}

Endnote

%0 Conference Paper
%T Variance-Dependent Regret Bounds for Linear Bandits and Reinforcement Learning: Adaptivity and Computational Efficiency
%A Heyang Zhao
%A Jiafan He
%A Dongruo Zhou
%A Tong Zhang
%A Quanquan Gu
%B Proceedings of Thirty Sixth Conference on Learning Theory
%C Proceedings of Machine Learning Research
%D 2023
%E Gergely Neu
%E Lorenzo Rosasco	
%F pmlr-v195-zhao23a
%I PMLR
%P 4977--5020
%U https://proceedings.mlr.press/v195/zhao23a.html
%V 195
%X Recently, several studies \citep{zhou2021nearly, zhang2021variance, kim2021improved, zhou2022computationally} have provided variance-dependent regret bounds for linear contextual bandits, which interpolates the regret for the worst-case regime and the deterministic reward regime. However, these algorithms are either computationally intractable or unable to handle unknown variance of the noise. In this paper, we present a novel solution to this open problem by proposing the \emph{first computationally efficient} algorithm for linear bandits with heteroscedastic noise. Our algorithm is adaptive to the unknown variance of noise and achieves an $\tilde{O}(d \sqrt{\sum_{k = 1}^K \sigma_k^2} + d)$  regret, where $\sigma_k^2$ is the \emph{variance} of the noise at the round $k$, $d$ is the dimension of the contexts and $K$ is the total number of rounds. Our results are based on an adaptive variance-aware confidence set enabled by a new Freedman-type concentration inequality for self-normalized martingales and a multi-layer structure to stratify the context vectors into different layers with different uniform upper bounds on the uncertainty.    Furthermore, our approach can be extended to linear mixture Markov decision processes (MDPs) in reinforcement learning. We propose a variance-adaptive algorithm for linear mixture MDPs, which achieves a problem-dependent horizon-free regret bound that can gracefully reduce to a nearly constant regret for deterministic MDPs. Unlike existing nearly minimax optimal algorithms for linear mixture MDPs, our algorithm does not require explicit variance estimation of the transitional probabilities or the use of high-order moment estimators to attain horizon-free regret. We believe the techniques developed in this paper can have independent value for general online decision making problems.

APA

Zhao, H., He, J., Zhou, D., Zhang, T. & Gu, Q.. (2023). Variance-Dependent Regret Bounds for Linear Bandits and Reinforcement Learning: Adaptivity and Computational Efficiency. Proceedings of Thirty Sixth Conference on Learning Theory, in Proceedings of Machine Learning Research 195:4977-5020 Available from https://proceedings.mlr.press/v195/zhao23a.html.

Related Material

Download PDF