Multi-step Greedy Reinforcement Learning Algorithms

Manan Tomar; Yonathan Efroni; Mohammad Ghavamzadeh

Multi-step Greedy Reinforcement Learning Algorithms

Manan Tomar, Yonathan Efroni, Mohammad Ghavamzadeh

Proceedings of the 37th International Conference on Machine Learning, PMLR 119:9504-9513, 2020.

Abstract

Multi-step greedy policies have been extensively used in model-based reinforcement learning (RL), both when a model of the environment is available (e.g., in the game of Go) and when it is learned. In this paper, we explore their benefits in model-free RL, when employed using multi-step dynamic programming algorithms: $\kappa$-Policy Iteration ($\kappa$-PI) and $\kappa$-Value Iteration ($\kappa$-VI). These methods iteratively compute the next policy ($\kappa$-PI) and value function ($\kappa$-VI) by solving a surrogate decision problem with a shaped reward and a smaller discount factor. We derive model-free RL algorithms based on $\kappa$-PI and $\kappa$-VI in which the surrogate problem can be solved by any discrete or continuous action RL method, such as DQN and TRPO. We identify the importance of a hyper-parameter that controls the extent to which the surrogate problem is solved and suggest a way to set this parameter. When evaluated on a range of Atari and MuJoCo benchmark tasks, our results indicate that for the right range of $\kappa$, our algorithms outperform DQN and TRPO. This shows that our multi-step greedy algorithms are general enough to be applied over any existing RL algorithm and can significantly improve its performance.

Cite this Paper

BibTeX

@InProceedings{pmlr-v119-tomar20a,
  title = 	 {Multi-step Greedy Reinforcement Learning Algorithms},
  author =       {Tomar, Manan and Efroni, Yonathan and Ghavamzadeh, Mohammad},
  booktitle = 	 {Proceedings of the 37th International Conference on Machine Learning},
  pages = 	 {9504--9513},
  year = 	 {2020},
  editor = 	 {III, Hal Daumé and Singh, Aarti},
  volume = 	 {119},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--18 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v119/tomar20a/tomar20a.pdf},
  url = 	 {https://proceedings.mlr.press/v119/tomar20a.html},
  abstract = 	 {Multi-step greedy policies have been extensively used in model-based reinforcement learning (RL), both when a model of the environment is available (e.g., in the game of Go) and when it is learned. In this paper, we explore their benefits in model-free RL, when employed using multi-step dynamic programming algorithms: $\kappa$-Policy Iteration ($\kappa$-PI) and $\kappa$-Value Iteration ($\kappa$-VI). These methods iteratively compute the next policy ($\kappa$-PI) and value function ($\kappa$-VI) by solving a surrogate decision problem with a shaped reward and a smaller discount factor. We derive model-free RL algorithms based on $\kappa$-PI and $\kappa$-VI in which the surrogate problem can be solved by any discrete or continuous action RL method, such as DQN and TRPO. We identify the importance of a hyper-parameter that controls the extent to which the surrogate problem is solved and suggest a way to set this parameter. When evaluated on a range of Atari and MuJoCo benchmark tasks, our results indicate that for the right range of $\kappa$, our algorithms outperform DQN and TRPO. This shows that our multi-step greedy algorithms are general enough to be applied over any existing RL algorithm and can significantly improve its performance.}
}

Endnote

%0 Conference Paper
%T Multi-step Greedy Reinforcement Learning Algorithms
%A Manan Tomar
%A Yonathan Efroni
%A Mohammad Ghavamzadeh
%B Proceedings of the 37th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2020
%E Hal Daumé III
%E Aarti Singh	
%F pmlr-v119-tomar20a
%I PMLR
%P 9504--9513
%U https://proceedings.mlr.press/v119/tomar20a.html
%V 119
%X Multi-step greedy policies have been extensively used in model-based reinforcement learning (RL), both when a model of the environment is available (e.g., in the game of Go) and when it is learned. In this paper, we explore their benefits in model-free RL, when employed using multi-step dynamic programming algorithms: $\kappa$-Policy Iteration ($\kappa$-PI) and $\kappa$-Value Iteration ($\kappa$-VI). These methods iteratively compute the next policy ($\kappa$-PI) and value function ($\kappa$-VI) by solving a surrogate decision problem with a shaped reward and a smaller discount factor. We derive model-free RL algorithms based on $\kappa$-PI and $\kappa$-VI in which the surrogate problem can be solved by any discrete or continuous action RL method, such as DQN and TRPO. We identify the importance of a hyper-parameter that controls the extent to which the surrogate problem is solved and suggest a way to set this parameter. When evaluated on a range of Atari and MuJoCo benchmark tasks, our results indicate that for the right range of $\kappa$, our algorithms outperform DQN and TRPO. This shows that our multi-step greedy algorithms are general enough to be applied over any existing RL algorithm and can significantly improve its performance.

APA

Tomar, M., Efroni, Y. & Ghavamzadeh, M.. (2020). Multi-step Greedy Reinforcement Learning Algorithms. Proceedings of the 37th International Conference on Machine Learning, in Proceedings of Machine Learning Research 119:9504-9513 Available from https://proceedings.mlr.press/v119/tomar20a.html.

Multi-step Greedy Reinforcement Learning Algorithms

Abstract

Cite this Paper

Related Material