Parallel $Q$-Learning: Scaling Off-policy Reinforcement Learning under Massively Parallel Simulation

Zechu Li, Tao Chen, Zhang-Wei Hong, Anurag Ajay, Pulkit Agrawal
Proceedings of the 40th International Conference on Machine Learning, PMLR 202:19440-19459, 2023.

Abstract

Reinforcement learning is time-consuming for complex tasks due to the need for large amounts of training data. Recent advances in GPU-based simulation, such as Isaac Gym, have sped up data collection thousands of times on a commodity GPU. Most prior works have used on-policy methods like PPO due to their simplicity and easy-to-scale nature. Off-policy methods are more sample-efficient, but challenging to scale, resulting in a longer wall-clock training time. This paper presents a novel Parallel Q-Learning (PQL) scheme that outperforms PPO in terms of wall-clock time and maintains superior sample efficiency. The driving force lies in the parallelization of data collection, policy function learning, and value function learning. Different from prior works on distributed off-policy learning, such as Apex, our scheme is designed specifically for massively parallel GPU-based simulation and optimized to work on a single workstation. In experiments, we demonstrate the capability of scaling up Q-learning methods to tens of thousands of parallel environments and investigate important factors that can affect learning speed, including the number of parallel environments, exploration strategies, batch size, GPU models, etc. The code is available at https://github.com/Improbable-AI/pql.

Cite this Paper


BibTeX
@InProceedings{pmlr-v202-li23f, title = {Parallel $Q$-Learning: Scaling Off-policy Reinforcement Learning under Massively Parallel Simulation}, author = {Li, Zechu and Chen, Tao and Hong, Zhang-Wei and Ajay, Anurag and Agrawal, Pulkit}, booktitle = {Proceedings of the 40th International Conference on Machine Learning}, pages = {19440--19459}, year = {2023}, editor = {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan}, volume = {202}, series = {Proceedings of Machine Learning Research}, month = {23--29 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v202/li23f/li23f.pdf}, url = {https://proceedings.mlr.press/v202/li23f.html}, abstract = {Reinforcement learning is time-consuming for complex tasks due to the need for large amounts of training data. Recent advances in GPU-based simulation, such as Isaac Gym, have sped up data collection thousands of times on a commodity GPU. Most prior works have used on-policy methods like PPO due to their simplicity and easy-to-scale nature. Off-policy methods are more sample-efficient, but challenging to scale, resulting in a longer wall-clock training time. This paper presents a novel Parallel Q-Learning (PQL) scheme that outperforms PPO in terms of wall-clock time and maintains superior sample efficiency. The driving force lies in the parallelization of data collection, policy function learning, and value function learning. Different from prior works on distributed off-policy learning, such as Apex, our scheme is designed specifically for massively parallel GPU-based simulation and optimized to work on a single workstation. In experiments, we demonstrate the capability of scaling up Q-learning methods to tens of thousands of parallel environments and investigate important factors that can affect learning speed, including the number of parallel environments, exploration strategies, batch size, GPU models, etc. The code is available at https://github.com/Improbable-AI/pql.} }
Endnote
%0 Conference Paper %T Parallel $Q$-Learning: Scaling Off-policy Reinforcement Learning under Massively Parallel Simulation %A Zechu Li %A Tao Chen %A Zhang-Wei Hong %A Anurag Ajay %A Pulkit Agrawal %B Proceedings of the 40th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2023 %E Andreas Krause %E Emma Brunskill %E Kyunghyun Cho %E Barbara Engelhardt %E Sivan Sabato %E Jonathan Scarlett %F pmlr-v202-li23f %I PMLR %P 19440--19459 %U https://proceedings.mlr.press/v202/li23f.html %V 202 %X Reinforcement learning is time-consuming for complex tasks due to the need for large amounts of training data. Recent advances in GPU-based simulation, such as Isaac Gym, have sped up data collection thousands of times on a commodity GPU. Most prior works have used on-policy methods like PPO due to their simplicity and easy-to-scale nature. Off-policy methods are more sample-efficient, but challenging to scale, resulting in a longer wall-clock training time. This paper presents a novel Parallel Q-Learning (PQL) scheme that outperforms PPO in terms of wall-clock time and maintains superior sample efficiency. The driving force lies in the parallelization of data collection, policy function learning, and value function learning. Different from prior works on distributed off-policy learning, such as Apex, our scheme is designed specifically for massively parallel GPU-based simulation and optimized to work on a single workstation. In experiments, we demonstrate the capability of scaling up Q-learning methods to tens of thousands of parallel environments and investigate important factors that can affect learning speed, including the number of parallel environments, exploration strategies, batch size, GPU models, etc. The code is available at https://github.com/Improbable-AI/pql.
APA
Li, Z., Chen, T., Hong, Z., Ajay, A. & Agrawal, P.. (2023). Parallel $Q$-Learning: Scaling Off-policy Reinforcement Learning under Massively Parallel Simulation. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202:19440-19459 Available from https://proceedings.mlr.press/v202/li23f.html.

Related Material