A Statistical Analysis of Polyak-Ruppert Averaged Q-Learning

Xiang Li, Wenhao Yang, Jiadong Liang, Zhihua Zhang, Michael I. Jordan
Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, PMLR 206:2207-2261, 2023.

Abstract

We study Q-learning with Polyak-Ruppert averaging (a.k.a., averaged Q-learning) in a discounted markov decision process in synchronous and tabular settings. Under a Lipschitz condition, we establish a functional central limit theorem for the averaged iteration $\bar{\mathbf{Q}}_T$ and show that its standardized partial-sum process converges weakly to a rescaled Brownian motion. The FCLT implies a fully online inference method for reinforcement learning. Furthermore, we show that $\bar{\mathbf{Q}}_T$ is the regular asymptotically linear (RAL) estimator for the optimal Q-value function $\mathbf{Q}^*$ that has the most efficient influence function. We present a nonasymptotic analysis for the $\ell_{\infty}$ error, $\mathbb{E}\|\bar{\mathbf{Q}}_T-\mathbf{Q}^*\|_{\infty}$, showing that it matches the instance-dependent lower bound for polynomial step sizes. Similar results are provided for entropy-regularized Q-Learning without the Lipschitz condition.

Cite this Paper


BibTeX
@InProceedings{pmlr-v206-li23b, title = {A Statistical Analysis of Polyak-Ruppert Averaged Q-Learning}, author = {Li, Xiang and Yang, Wenhao and Liang, Jiadong and Zhang, Zhihua and Jordan, Michael I.}, booktitle = {Proceedings of The 26th International Conference on Artificial Intelligence and Statistics}, pages = {2207--2261}, year = {2023}, editor = {Ruiz, Francisco and Dy, Jennifer and van de Meent, Jan-Willem}, volume = {206}, series = {Proceedings of Machine Learning Research}, month = {25--27 Apr}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v206/li23b/li23b.pdf}, url = {https://proceedings.mlr.press/v206/li23b.html}, abstract = {We study Q-learning with Polyak-Ruppert averaging (a.k.a., averaged Q-learning) in a discounted markov decision process in synchronous and tabular settings. Under a Lipschitz condition, we establish a functional central limit theorem for the averaged iteration $\bar{\mathbf{Q}}_T$ and show that its standardized partial-sum process converges weakly to a rescaled Brownian motion. The FCLT implies a fully online inference method for reinforcement learning. Furthermore, we show that $\bar{\mathbf{Q}}_T$ is the regular asymptotically linear (RAL) estimator for the optimal Q-value function $\mathbf{Q}^*$ that has the most efficient influence function. We present a nonasymptotic analysis for the $\ell_{\infty}$ error, $\mathbb{E}\|\bar{\mathbf{Q}}_T-\mathbf{Q}^*\|_{\infty}$, showing that it matches the instance-dependent lower bound for polynomial step sizes. Similar results are provided for entropy-regularized Q-Learning without the Lipschitz condition.} }
Endnote
%0 Conference Paper %T A Statistical Analysis of Polyak-Ruppert Averaged Q-Learning %A Xiang Li %A Wenhao Yang %A Jiadong Liang %A Zhihua Zhang %A Michael I. Jordan %B Proceedings of The 26th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2023 %E Francisco Ruiz %E Jennifer Dy %E Jan-Willem van de Meent %F pmlr-v206-li23b %I PMLR %P 2207--2261 %U https://proceedings.mlr.press/v206/li23b.html %V 206 %X We study Q-learning with Polyak-Ruppert averaging (a.k.a., averaged Q-learning) in a discounted markov decision process in synchronous and tabular settings. Under a Lipschitz condition, we establish a functional central limit theorem for the averaged iteration $\bar{\mathbf{Q}}_T$ and show that its standardized partial-sum process converges weakly to a rescaled Brownian motion. The FCLT implies a fully online inference method for reinforcement learning. Furthermore, we show that $\bar{\mathbf{Q}}_T$ is the regular asymptotically linear (RAL) estimator for the optimal Q-value function $\mathbf{Q}^*$ that has the most efficient influence function. We present a nonasymptotic analysis for the $\ell_{\infty}$ error, $\mathbb{E}\|\bar{\mathbf{Q}}_T-\mathbf{Q}^*\|_{\infty}$, showing that it matches the instance-dependent lower bound for polynomial step sizes. Similar results are provided for entropy-regularized Q-Learning without the Lipschitz condition.
APA
Li, X., Yang, W., Liang, J., Zhang, Z. & Jordan, M.I.. (2023). A Statistical Analysis of Polyak-Ruppert Averaged Q-Learning. Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 206:2207-2261 Available from https://proceedings.mlr.press/v206/li23b.html.

Related Material