A Theory of Regularized Markov Decision Processes

Matthieu Geist; Bruno Scherrer; Olivier Pietquin

A Theory of Regularized Markov Decision Processes

Matthieu Geist, Bruno Scherrer, Olivier Pietquin

Proceedings of the 36th International Conference on Machine Learning, PMLR 97:2160-2169, 2019.

Abstract

Many recent successful (deep) reinforcement learning algorithms make use of regularization, generally based on entropy or Kullback-Leibler divergence. We propose a general theory of regularized Markov Decision Processes that generalizes these approaches in two directions: we consider a larger class of regularizers, and we consider the general modified policy iteration approach, encompassing both policy iteration and value iteration. The core building blocks of this theory are a notion of regularized Bellman operator and the Legendre-Fenchel transform, a classical tool of convex optimization. This approach allows for error propagation analyses of general algorithmic schemes of which (possibly variants of) classical algorithms such as Trust Region Policy Optimization, Soft Q-learning, Stochastic Actor Critic or Dynamic Policy Programming are special cases. This also draws connections to proximal convex optimization, especially to Mirror Descent.

Cite this Paper

BibTeX


@InProceedings{pmlr-v97-geist19a,
  title = 	 {A Theory of Regularized {M}arkov Decision Processes},
  author =       {Geist, Matthieu and Scherrer, Bruno and Pietquin, Olivier},
  booktitle = 	 {Proceedings of the 36th International Conference on Machine Learning},
  pages = 	 {2160--2169},
  year = 	 {2019},
  editor = 	 {Chaudhuri, Kamalika and Salakhutdinov, Ruslan},
  volume = 	 {97},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {09--15 Jun},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v97/geist19a/geist19a.pdf},
  url = 	 {https://proceedings.mlr.press/v97/geist19a.html},
  abstract = 	 {Many recent successful (deep) reinforcement learning algorithms make use of regularization, generally based on entropy or Kullback-Leibler divergence. We propose a general theory of regularized Markov Decision Processes that generalizes these approaches in two directions: we consider a larger class of regularizers, and we consider the general modified policy iteration approach, encompassing both policy iteration and value iteration. The core building blocks of this theory are a notion of regularized Bellman operator and the Legendre-Fenchel transform, a classical tool of convex optimization. This approach allows for error propagation analyses of general algorithmic schemes of which (possibly variants of) classical algorithms such as Trust Region Policy Optimization, Soft Q-learning, Stochastic Actor Critic or Dynamic Policy Programming are special cases. This also draws connections to proximal convex optimization, especially to Mirror Descent.}
}

Endnote

%0 Conference Paper
%T A Theory of Regularized Markov Decision Processes
%A Matthieu Geist
%A Bruno Scherrer
%A Olivier Pietquin
%B Proceedings of the 36th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2019
%E Kamalika Chaudhuri
%E Ruslan Salakhutdinov	
%F pmlr-v97-geist19a
%I PMLR
%P 2160--2169
%U https://proceedings.mlr.press/v97/geist19a.html
%V 97
%X Many recent successful (deep) reinforcement learning algorithms make use of regularization, generally based on entropy or Kullback-Leibler divergence. We propose a general theory of regularized Markov Decision Processes that generalizes these approaches in two directions: we consider a larger class of regularizers, and we consider the general modified policy iteration approach, encompassing both policy iteration and value iteration. The core building blocks of this theory are a notion of regularized Bellman operator and the Legendre-Fenchel transform, a classical tool of convex optimization. This approach allows for error propagation analyses of general algorithmic schemes of which (possibly variants of) classical algorithms such as Trust Region Policy Optimization, Soft Q-learning, Stochastic Actor Critic or Dynamic Policy Programming are special cases. This also draws connections to proximal convex optimization, especially to Mirror Descent.

APA


Geist, M., Scherrer, B. & Pietquin, O.. (2019). A Theory of Regularized Markov Decision Processes. Proceedings of the 36th International Conference on Machine Learning, in Proceedings of Machine Learning Research 97:2160-2169 Available from https://proceedings.mlr.press/v97/geist19a.html.

A Theory of Regularized Markov Decision Processes

Abstract

Cite this Paper

Related Material