On-Policy Deep Reinforcement Learning for the Average-Reward Criterion

Yiming Zhang; Keith W Ross

On-Policy Deep Reinforcement Learning for the Average-Reward Criterion

Yiming Zhang, Keith W Ross

Proceedings of the 38th International Conference on Machine Learning, PMLR 139:12535-12545, 2021.

Abstract

We develop theory and algorithms for average-reward on-policy Reinforcement Learning (RL). We first consider bounding the difference of the long-term average reward for two policies. We show that previous work based on the discounted return (Schulman et al. 2015, Achiam et al. 2017) results in a non-meaningful lower bound in the average reward setting. By addressing the average-reward criterion directly, we then derive a novel bound which depends on the average divergence between the policies and on Kemeny’s constant. Based on this bound, we develop an iterative procedure which produces a sequence of monotonically improved policies for the average reward criterion. This iterative procedure can then be combined with classic Deep Reinforcement Learning (DRL) methods, resulting in practical DRL algorithms that target the long-run average reward criterion. In particular, we demonstrate that Average-Reward TRPO (ATRPO), which adapts the on-policy TRPO algorithm to the average-reward criterion, significantly outperforms TRPO in the most challenging MuJuCo environments.

Cite this Paper

BibTeX


@InProceedings{pmlr-v139-zhang21q,
  title = 	 {On-Policy Deep Reinforcement Learning for the Average-Reward Criterion},
  author =       {Zhang, Yiming and Ross, Keith W},
  booktitle = 	 {Proceedings of the 38th International Conference on Machine Learning},
  pages = 	 {12535--12545},
  year = 	 {2021},
  editor = 	 {Meila, Marina and Zhang, Tong},
  volume = 	 {139},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {18--24 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v139/zhang21q/zhang21q.pdf},
  url = 	 {https://proceedings.mlr.press/v139/zhang21q.html},
  abstract = 	 {We develop theory and algorithms for average-reward on-policy Reinforcement Learning (RL). We first consider bounding the difference of the long-term average reward for two policies. We show that previous work based on the discounted return (Schulman et al. 2015, Achiam et al. 2017) results in a non-meaningful lower bound in the average reward setting. By addressing the average-reward criterion directly, we then derive a novel bound which depends on the average divergence between the policies and on Kemeny’s constant. Based on this bound, we develop an iterative procedure which produces a sequence of monotonically improved policies for the average reward criterion. This iterative procedure can then be combined with classic Deep Reinforcement Learning (DRL) methods, resulting in practical DRL algorithms that target the long-run average reward criterion. In particular, we demonstrate that Average-Reward TRPO (ATRPO), which adapts the on-policy TRPO algorithm to the average-reward criterion, significantly outperforms TRPO in the most challenging MuJuCo environments.}
}

Endnote

%0 Conference Paper
%T On-Policy Deep Reinforcement Learning for the Average-Reward Criterion
%A Yiming Zhang
%A Keith W Ross
%B Proceedings of the 38th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2021
%E Marina Meila
%E Tong Zhang	
%F pmlr-v139-zhang21q
%I PMLR
%P 12535--12545
%U https://proceedings.mlr.press/v139/zhang21q.html
%V 139
%X We develop theory and algorithms for average-reward on-policy Reinforcement Learning (RL). We first consider bounding the difference of the long-term average reward for two policies. We show that previous work based on the discounted return (Schulman et al. 2015, Achiam et al. 2017) results in a non-meaningful lower bound in the average reward setting. By addressing the average-reward criterion directly, we then derive a novel bound which depends on the average divergence between the policies and on Kemeny’s constant. Based on this bound, we develop an iterative procedure which produces a sequence of monotonically improved policies for the average reward criterion. This iterative procedure can then be combined with classic Deep Reinforcement Learning (DRL) methods, resulting in practical DRL algorithms that target the long-run average reward criterion. In particular, we demonstrate that Average-Reward TRPO (ATRPO), which adapts the on-policy TRPO algorithm to the average-reward criterion, significantly outperforms TRPO in the most challenging MuJuCo environments.

APA


Zhang, Y. & Ross, K.W.. (2021). On-Policy Deep Reinforcement Learning for the Average-Reward Criterion. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:12535-12545 Available from https://proceedings.mlr.press/v139/zhang21q.html.

On-Policy Deep Reinforcement Learning for the Average-Reward Criterion

Abstract

Cite this Paper

Related Material