A Statistical Analysis of Polyak-Ruppert Averaged Q-Learning

Xiang Li; Wenhao Yang; Jiadong Liang; Zhihua Zhang; Michael I. Jordan

A Statistical Analysis of Polyak-Ruppert Averaged Q-Learning

Xiang Li, Wenhao Yang, Jiadong Liang, Zhihua Zhang, Michael I. Jordan

Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, PMLR 206:2207-2261, 2023.

Abstract

We study Q-learning with Polyak-Ruppert averaging (a.k.a., averaged Q-learning) in a discounted markov decision process in synchronous and tabular settings. Under a Lipschitz condition, we establish a functional central limit theorem for the averaged iteration $\bar{\mathbf{Q}}_T$ and show that its standardized partial-sum process converges weakly to a rescaled Brownian motion. The FCLT implies a fully online inference method for reinforcement learning. Furthermore, we show that $\bar{\mathbf{Q}}_T$ is the regular asymptotically linear (RAL) estimator for the optimal Q-value function $\mathbf{Q}^*$ that has the most efficient influence function. We present a nonasymptotic analysis for the $\ell_{\infty}$ error, $\mathbb{E}\|\bar{\mathbf{Q}}_T-\mathbf{Q}^*\|_{\infty}$, showing that it matches the instance-dependent lower bound for polynomial step sizes. Similar results are provided for entropy-regularized Q-Learning without the Lipschitz condition.

Cite this Paper

BibTeX


@InProceedings{pmlr-v206-li23b,
  title = 	 {A Statistical Analysis of Polyak-Ruppert Averaged Q-Learning},
  author =       {Li, Xiang and Yang, Wenhao and Liang, Jiadong and Zhang, Zhihua and Jordan, Michael I.},
  booktitle = 	 {Proceedings of The 26th International Conference on Artificial Intelligence and Statistics},
  pages = 	 {2207--2261},
  year = 	 {2023},
  editor = 	 {Ruiz, Francisco and Dy, Jennifer and van de Meent, Jan-Willem},
  volume = 	 {206},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {25--27 Apr},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v206/li23b/li23b.pdf},
  url = 	 {https://proceedings.mlr.press/v206/li23b.html},
  abstract = 	 {We study Q-learning with Polyak-Ruppert averaging (a.k.a., averaged Q-learning) in a discounted markov decision process in synchronous and tabular settings. Under a Lipschitz condition, we establish a functional central limit theorem for the averaged iteration $\bar{\mathbf{Q}}_T$ and show that its standardized partial-sum process converges weakly to a rescaled Brownian motion. The FCLT implies a fully online inference method for reinforcement learning. Furthermore, we show that $\bar{\mathbf{Q}}_T$ is the regular asymptotically linear (RAL) estimator for the optimal Q-value function $\mathbf{Q}^*$ that has the most efficient influence function. We present a nonasymptotic analysis for the $\ell_{\infty}$ error, $\mathbb{E}\|\bar{\mathbf{Q}}_T-\mathbf{Q}^*\|_{\infty}$, showing that it matches the instance-dependent lower bound for polynomial step sizes. Similar results are provided for entropy-regularized Q-Learning without the Lipschitz condition.}
}

Endnote

%0 Conference Paper
%T A Statistical Analysis of Polyak-Ruppert Averaged Q-Learning
%A Xiang Li
%A Wenhao Yang
%A Jiadong Liang
%A Zhihua Zhang
%A Michael I. Jordan
%B Proceedings of The 26th International Conference on Artificial Intelligence and Statistics
%C Proceedings of Machine Learning Research
%D 2023
%E Francisco Ruiz
%E Jennifer Dy
%E Jan-Willem van de Meent	
%F pmlr-v206-li23b
%I PMLR
%P 2207--2261
%U https://proceedings.mlr.press/v206/li23b.html
%V 206
%X We study Q-learning with Polyak-Ruppert averaging (a.k.a., averaged Q-learning) in a discounted markov decision process in synchronous and tabular settings. Under a Lipschitz condition, we establish a functional central limit theorem for the averaged iteration $\bar{\mathbf{Q}}_T$ and show that its standardized partial-sum process converges weakly to a rescaled Brownian motion. The FCLT implies a fully online inference method for reinforcement learning. Furthermore, we show that $\bar{\mathbf{Q}}_T$ is the regular asymptotically linear (RAL) estimator for the optimal Q-value function $\mathbf{Q}^*$ that has the most efficient influence function. We present a nonasymptotic analysis for the $\ell_{\infty}$ error, $\mathbb{E}\|\bar{\mathbf{Q}}_T-\mathbf{Q}^*\|_{\infty}$, showing that it matches the instance-dependent lower bound for polynomial step sizes. Similar results are provided for entropy-regularized Q-Learning without the Lipschitz condition.

APA


Li, X., Yang, W., Liang, J., Zhang, Z. & Jordan, M.I.. (2023). A Statistical Analysis of Polyak-Ruppert Averaged Q-Learning. Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 206:2207-2261 Available from https://proceedings.mlr.press/v206/li23b.html.

Related Material

Download PDF