Off-Policy Reinforcement Learning with Delayed Rewards

Beining Han, Zhizhou Ren, Zuofan Wu, Yuan Zhou, Jian Peng
Proceedings of the 39th International Conference on Machine Learning, PMLR 162:8280-8303, 2022.

Abstract

We study deep reinforcement learning (RL) algorithms with delayed rewards. In many real-world tasks, instant rewards are often not readily accessible or even defined immediately after the agent performs actions. In this work, we first formally define the environment with delayed rewards and discuss the challenges raised due to the non-Markovian nature of such environments. Then, we introduce a general off-policy RL framework with a new Q-function formulation that can handle the delayed rewards with theoretical convergence guarantees. For practical tasks with high dimensional state spaces, we further introduce the HC-decomposition rule of the Q-function in our framework which naturally leads to an approximation scheme that helps boost the training efficiency and stability. We finally conduct extensive experiments to demonstrate the superior performance of our algorithms over the existing work and their variants.

Cite this Paper


BibTeX
@InProceedings{pmlr-v162-han22e, title = {Off-Policy Reinforcement Learning with Delayed Rewards}, author = {Han, Beining and Ren, Zhizhou and Wu, Zuofan and Zhou, Yuan and Peng, Jian}, booktitle = {Proceedings of the 39th International Conference on Machine Learning}, pages = {8280--8303}, year = {2022}, editor = {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan}, volume = {162}, series = {Proceedings of Machine Learning Research}, month = {17--23 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v162/han22e/han22e.pdf}, url = {https://proceedings.mlr.press/v162/han22e.html}, abstract = {We study deep reinforcement learning (RL) algorithms with delayed rewards. In many real-world tasks, instant rewards are often not readily accessible or even defined immediately after the agent performs actions. In this work, we first formally define the environment with delayed rewards and discuss the challenges raised due to the non-Markovian nature of such environments. Then, we introduce a general off-policy RL framework with a new Q-function formulation that can handle the delayed rewards with theoretical convergence guarantees. For practical tasks with high dimensional state spaces, we further introduce the HC-decomposition rule of the Q-function in our framework which naturally leads to an approximation scheme that helps boost the training efficiency and stability. We finally conduct extensive experiments to demonstrate the superior performance of our algorithms over the existing work and their variants.} }
Endnote
%0 Conference Paper %T Off-Policy Reinforcement Learning with Delayed Rewards %A Beining Han %A Zhizhou Ren %A Zuofan Wu %A Yuan Zhou %A Jian Peng %B Proceedings of the 39th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2022 %E Kamalika Chaudhuri %E Stefanie Jegelka %E Le Song %E Csaba Szepesvari %E Gang Niu %E Sivan Sabato %F pmlr-v162-han22e %I PMLR %P 8280--8303 %U https://proceedings.mlr.press/v162/han22e.html %V 162 %X We study deep reinforcement learning (RL) algorithms with delayed rewards. In many real-world tasks, instant rewards are often not readily accessible or even defined immediately after the agent performs actions. In this work, we first formally define the environment with delayed rewards and discuss the challenges raised due to the non-Markovian nature of such environments. Then, we introduce a general off-policy RL framework with a new Q-function formulation that can handle the delayed rewards with theoretical convergence guarantees. For practical tasks with high dimensional state spaces, we further introduce the HC-decomposition rule of the Q-function in our framework which naturally leads to an approximation scheme that helps boost the training efficiency and stability. We finally conduct extensive experiments to demonstrate the superior performance of our algorithms over the existing work and their variants.
APA
Han, B., Ren, Z., Wu, Z., Zhou, Y. & Peng, J.. (2022). Off-Policy Reinforcement Learning with Delayed Rewards. Proceedings of the 39th International Conference on Machine Learning, in Proceedings of Machine Learning Research 162:8280-8303 Available from https://proceedings.mlr.press/v162/han22e.html.

Related Material