Nearly Minimax Optimal Regret for Learning Infinite-horizon Average-reward MDPs with Linear Function Approximation

Yue Wu; Dongruo Zhou; Quanquan Gu

Nearly Minimax Optimal Regret for Learning Infinite-horizon Average-reward MDPs with Linear Function Approximation

Yue Wu, Dongruo Zhou, Quanquan Gu

Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, PMLR 151:3883-3913, 2022.

Abstract

We study reinforcement learning in an infinite-horizon average-reward setting with linear function approximation for linear mixture Markov decision processes (MDPs), where the transition probability function of the underlying MDP admits a linear form over a feature mapping of the current state, action, and next state. We propose a new algorithm UCRL2-VTR, which can be seen as an extension of the UCRL2 algorithm with linear function approximation. We show that UCRL2-VTR with Bernstein-type bonus can achieve a regret of $\tilde{O}(d\sqrt{DT})$, where $d$ is the dimension of the feature mapping, $T$ is the horizon, and $D$ is the diameter of the MDP. We also prove a matching lower bound $\tilde{\Omega}(d\sqrt{DT})$, which suggests that the proposed UCRL2-VTR is minimax optimal up to logarithmic factors. To the best of our knowledge, our algorithm is the first nearly minimax optimal RL algorithm with function approximation in the infinite-horizon average-reward setting.

Cite this Paper

BibTeX


@InProceedings{pmlr-v151-wu22a,
  title = 	 { Nearly Minimax Optimal Regret for Learning Infinite-horizon Average-reward MDPs with Linear Function Approximation },
  author =       {Wu, Yue and Zhou, Dongruo and Gu, Quanquan},
  booktitle = 	 {Proceedings of The 25th International Conference on Artificial Intelligence and Statistics},
  pages = 	 {3883--3913},
  year = 	 {2022},
  editor = 	 {Camps-Valls, Gustau and Ruiz, Francisco J. R. and Valera, Isabel},
  volume = 	 {151},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {28--30 Mar},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v151/wu22a/wu22a.pdf},
  url = 	 {https://proceedings.mlr.press/v151/wu22a.html},
  abstract = 	 { We study reinforcement learning in an infinite-horizon average-reward setting with linear function approximation for linear mixture Markov decision processes (MDPs), where the transition probability function of the underlying MDP admits a linear form over a feature mapping of the current state, action, and next state. We propose a new algorithm UCRL2-VTR, which can be seen as an extension of the UCRL2 algorithm with linear function approximation. We show that UCRL2-VTR with Bernstein-type bonus can achieve a regret of $\tilde{O}(d\sqrt{DT})$, where $d$ is the dimension of the feature mapping, $T$ is the horizon, and $D$ is the diameter of the MDP. We also prove a matching lower bound $\tilde{\Omega}(d\sqrt{DT})$, which suggests that the proposed UCRL2-VTR is minimax optimal up to logarithmic factors. To the best of our knowledge, our algorithm is the first nearly minimax optimal RL algorithm with function approximation in the infinite-horizon average-reward setting. }
}

Endnote

%0 Conference Paper
%T  Nearly Minimax Optimal Regret for Learning Infinite-horizon Average-reward MDPs with Linear Function Approximation 
%A Yue Wu
%A Dongruo Zhou
%A Quanquan Gu
%B Proceedings of The 25th International Conference on Artificial Intelligence and Statistics
%C Proceedings of Machine Learning Research
%D 2022
%E Gustau Camps-Valls
%E Francisco J. R. Ruiz
%E Isabel Valera	
%F pmlr-v151-wu22a
%I PMLR
%P 3883--3913
%U https://proceedings.mlr.press/v151/wu22a.html
%V 151
%X  We study reinforcement learning in an infinite-horizon average-reward setting with linear function approximation for linear mixture Markov decision processes (MDPs), where the transition probability function of the underlying MDP admits a linear form over a feature mapping of the current state, action, and next state. We propose a new algorithm UCRL2-VTR, which can be seen as an extension of the UCRL2 algorithm with linear function approximation. We show that UCRL2-VTR with Bernstein-type bonus can achieve a regret of $\tilde{O}(d\sqrt{DT})$, where $d$ is the dimension of the feature mapping, $T$ is the horizon, and $D$ is the diameter of the MDP. We also prove a matching lower bound $\tilde{\Omega}(d\sqrt{DT})$, which suggests that the proposed UCRL2-VTR is minimax optimal up to logarithmic factors. To the best of our knowledge, our algorithm is the first nearly minimax optimal RL algorithm with function approximation in the infinite-horizon average-reward setting.

APA


Wu, Y., Zhou, D. & Gu, Q.. (2022).  Nearly Minimax Optimal Regret for Learning Infinite-horizon Average-reward MDPs with Linear Function Approximation . Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 151:3883-3913 Available from https://proceedings.mlr.press/v151/wu22a.html.

Nearly Minimax Optimal Regret for Learning Infinite-horizon Average-reward MDPs with Linear Function Approximation

Abstract

Cite this Paper

Related Material