Stabilizing Q-learning with Linear Architectures for Provable Efficient Learning

Andrea Zanette, Martin Wainwright
Proceedings of the 39th International Conference on Machine Learning, PMLR 162:25920-25954, 2022.

Abstract

The Q-learning algorithm is a simple, fundamental and practically very effective reinforcement learning algorithm. However, the basic protocol can exhibit an unstable behavior when implemented even with simple linear function approximation. While tools like target networks and experience replay are often implemented to stabilize the learning process, the individual contribution of each of these mechanisms is not well understood theoretically. This work proposes an exploration variant of the basic Q-learning protocol with linear function approximation. Our modular analysis illustrates the role played by each algorithmic tool that we adopt: a second order update rule, a set of target networks, and a mechanism akin to experience replay. Together, they enable state of the art regret bounds on linear MDPs while preserving the most prominent feature of the algorithm, namely a space complexity independent of the number of steps elapsed. Furthermore, we show that the performance of the algorithm degrades very gracefully under a new, more permissive notion of approximation error. Finally, the algorithm partially inherits problem dependent regret bounds, function of the number of ‘effective’ feature dimension.

Cite this Paper


BibTeX
@InProceedings{pmlr-v162-zanette22a, title = {Stabilizing Q-learning with Linear Architectures for Provable Efficient Learning}, author = {Zanette, Andrea and Wainwright, Martin}, booktitle = {Proceedings of the 39th International Conference on Machine Learning}, pages = {25920--25954}, year = {2022}, editor = {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan}, volume = {162}, series = {Proceedings of Machine Learning Research}, month = {17--23 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v162/zanette22a/zanette22a.pdf}, url = {https://proceedings.mlr.press/v162/zanette22a.html}, abstract = {The Q-learning algorithm is a simple, fundamental and practically very effective reinforcement learning algorithm. However, the basic protocol can exhibit an unstable behavior when implemented even with simple linear function approximation. While tools like target networks and experience replay are often implemented to stabilize the learning process, the individual contribution of each of these mechanisms is not well understood theoretically. This work proposes an exploration variant of the basic Q-learning protocol with linear function approximation. Our modular analysis illustrates the role played by each algorithmic tool that we adopt: a second order update rule, a set of target networks, and a mechanism akin to experience replay. Together, they enable state of the art regret bounds on linear MDPs while preserving the most prominent feature of the algorithm, namely a space complexity independent of the number of steps elapsed. Furthermore, we show that the performance of the algorithm degrades very gracefully under a new, more permissive notion of approximation error. Finally, the algorithm partially inherits problem dependent regret bounds, function of the number of ‘effective’ feature dimension.} }
Endnote
%0 Conference Paper %T Stabilizing Q-learning with Linear Architectures for Provable Efficient Learning %A Andrea Zanette %A Martin Wainwright %B Proceedings of the 39th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2022 %E Kamalika Chaudhuri %E Stefanie Jegelka %E Le Song %E Csaba Szepesvari %E Gang Niu %E Sivan Sabato %F pmlr-v162-zanette22a %I PMLR %P 25920--25954 %U https://proceedings.mlr.press/v162/zanette22a.html %V 162 %X The Q-learning algorithm is a simple, fundamental and practically very effective reinforcement learning algorithm. However, the basic protocol can exhibit an unstable behavior when implemented even with simple linear function approximation. While tools like target networks and experience replay are often implemented to stabilize the learning process, the individual contribution of each of these mechanisms is not well understood theoretically. This work proposes an exploration variant of the basic Q-learning protocol with linear function approximation. Our modular analysis illustrates the role played by each algorithmic tool that we adopt: a second order update rule, a set of target networks, and a mechanism akin to experience replay. Together, they enable state of the art regret bounds on linear MDPs while preserving the most prominent feature of the algorithm, namely a space complexity independent of the number of steps elapsed. Furthermore, we show that the performance of the algorithm degrades very gracefully under a new, more permissive notion of approximation error. Finally, the algorithm partially inherits problem dependent regret bounds, function of the number of ‘effective’ feature dimension.
APA
Zanette, A. & Wainwright, M.. (2022). Stabilizing Q-learning with Linear Architectures for Provable Efficient Learning. Proceedings of the 39th International Conference on Machine Learning, in Proceedings of Machine Learning Research 162:25920-25954 Available from https://proceedings.mlr.press/v162/zanette22a.html.

Related Material