Reinforcement Learning in Parametric MDPs with Exponential Families

Sayak Ray Chowdhury; Aditya Gopalan; Odalric-Ambrym Maillard

Reinforcement Learning in Parametric MDPs with Exponential Families

Sayak Ray Chowdhury, Aditya Gopalan, Odalric-Ambrym Maillard

Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, PMLR 130:1855-1863, 2021.

Abstract

Extending model-based regret minimization strategies for Markov decision processes (MDPs) beyond discrete state-action spaces requires structural assumptions on the reward and transition models. Existing parametric approaches establish regret guarantees by making strong assumptions about either the state transition distribution or the value function as a function of state-action features, and often do not satisfactorily capture classical problems like linear dynamical systems or factored MDPs. This paper introduces a new MDP transition model defined by a collection of linearly parameterized exponential families with $d$ unknown parameters. For finite-horizon episodic RL with horizon $H$ in this MDP model, we propose a model-based upper confidence RL algorithm (Exp-UCRL) that solves a penalized maximum likelihood estimation problem to learn the $d$-dimensional representation of the transition distribution, balancing the exploitation-exploration tradeoff using confidence sets in the exponential family space. We demonstrate the efficiency of our algorithm by proving a frequentist (worst-case) regret bound that is of order $\tilde O(d\sqrt{H^3 N})$, sub-linear in total time $N$, linear in dimension $d$, and polynomial in the planning horizon $H$. This is achieved by deriving a novel concentration inequality for conditional exponential families that might be of independent interest. The exponential family MDP model also admits an efficient posterior sampling-style algorithm for which a similar guarantee on the Bayesian regret is shown.

Cite this Paper

BibTeX


@InProceedings{pmlr-v130-chowdhury21b,
  title = 	 { Reinforcement Learning in Parametric MDPs with Exponential Families },
  author =       {Chowdhury, Sayak Ray and Gopalan, Aditya and Maillard, Odalric-Ambrym},
  booktitle = 	 {Proceedings of The 24th International Conference on Artificial Intelligence and Statistics},
  pages = 	 {1855--1863},
  year = 	 {2021},
  editor = 	 {Banerjee, Arindam and Fukumizu, Kenji},
  volume = 	 {130},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--15 Apr},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v130/chowdhury21b/chowdhury21b.pdf},
  url = 	 {https://proceedings.mlr.press/v130/chowdhury21b.html},
  abstract = 	 { Extending model-based regret minimization strategies for Markov decision processes (MDPs) beyond discrete state-action spaces requires structural assumptions on the reward and transition models. Existing parametric approaches establish regret guarantees by making strong assumptions about either the state transition distribution or the value function as a function of state-action features, and often do not satisfactorily capture classical problems like linear dynamical systems or factored MDPs. This paper introduces a new MDP transition model defined by a collection of linearly parameterized exponential families with $d$ unknown parameters. For finite-horizon episodic RL with horizon $H$ in this MDP model, we propose a model-based upper confidence RL algorithm (Exp-UCRL) that solves a penalized maximum likelihood estimation problem to learn the $d$-dimensional representation of the transition distribution, balancing the exploitation-exploration tradeoff using confidence sets in the exponential family space. We demonstrate the efficiency of our algorithm by proving a frequentist (worst-case) regret bound that is of order $\tilde O(d\sqrt{H^3 N})$, sub-linear in total time $N$, linear in dimension $d$, and polynomial in the planning horizon $H$. This is achieved by deriving a novel concentration inequality for conditional exponential families that might be of independent interest. The exponential family MDP model also admits an efficient posterior sampling-style algorithm for which a similar guarantee on the Bayesian regret is shown. }
}

Endnote

%0 Conference Paper
%T  Reinforcement Learning in Parametric MDPs with Exponential Families 
%A Sayak Ray Chowdhury
%A Aditya Gopalan
%A Odalric-Ambrym Maillard
%B Proceedings of The 24th International Conference on Artificial Intelligence and Statistics
%C Proceedings of Machine Learning Research
%D 2021
%E Arindam Banerjee
%E Kenji Fukumizu	
%F pmlr-v130-chowdhury21b
%I PMLR
%P 1855--1863
%U https://proceedings.mlr.press/v130/chowdhury21b.html
%V 130
%X  Extending model-based regret minimization strategies for Markov decision processes (MDPs) beyond discrete state-action spaces requires structural assumptions on the reward and transition models. Existing parametric approaches establish regret guarantees by making strong assumptions about either the state transition distribution or the value function as a function of state-action features, and often do not satisfactorily capture classical problems like linear dynamical systems or factored MDPs. This paper introduces a new MDP transition model defined by a collection of linearly parameterized exponential families with $d$ unknown parameters. For finite-horizon episodic RL with horizon $H$ in this MDP model, we propose a model-based upper confidence RL algorithm (Exp-UCRL) that solves a penalized maximum likelihood estimation problem to learn the $d$-dimensional representation of the transition distribution, balancing the exploitation-exploration tradeoff using confidence sets in the exponential family space. We demonstrate the efficiency of our algorithm by proving a frequentist (worst-case) regret bound that is of order $\tilde O(d\sqrt{H^3 N})$, sub-linear in total time $N$, linear in dimension $d$, and polynomial in the planning horizon $H$. This is achieved by deriving a novel concentration inequality for conditional exponential families that might be of independent interest. The exponential family MDP model also admits an efficient posterior sampling-style algorithm for which a similar guarantee on the Bayesian regret is shown.

APA


Chowdhury, S.R., Gopalan, A. & Maillard, O.. (2021).  Reinforcement Learning in Parametric MDPs with Exponential Families . Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 130:1855-1863 Available from https://proceedings.mlr.press/v130/chowdhury21b.html.

Reinforcement Learning in Parametric MDPs with Exponential Families

Abstract

Cite this Paper

Related Material