A Tighter Analysis of Randomised Policy Iteration

Meet Taraviya; Shivaram Kalyanakrishnan

A Tighter Analysis of Randomised Policy Iteration

Meet Taraviya, Shivaram Kalyanakrishnan

Proceedings of The 35th Uncertainty in Artificial Intelligence Conference, PMLR 115:519-529, 2020.

Abstract

Policy Iteration (PI) is a popular family of algorithms to compute an optimal policy for a givenMarkov Decision Problem (MDP). Starting from an arbitrary initial policy, PI repeatedly performs locally-improving switches until an optimal policy is found. The exact form of the switching rule gives rise to different variants of PI. Two decades ago, Mansour and Singh [1999] provided the first non-trivial “strong” upper bound on the number of iterations taken by “Howard’s PI” (HPI), a widelyused variant of PI (strong bounds depend only on the number of states and actions in the MDP). They also proposed a randomised variant (RPI) and showed an even tighter strong upper bound. Their bounds for HPI and RPI have not been improved subsequently.\\{We} revisit the algorithms and analysis of Mansour and Singh [1999]. We prove a novel result on the structure of the policy space for k-action MDPs, $k\geq 2$, which generalises a result known for k = 2. Also proposing a new counting argument, we obtain a strong bound of (O$(\sqrt{k \log k}))^n$ iterations for an algorithm akin to RPI, improving significantly upon Mansour and Singh’s original bound of roughly O($(k/2)^n$). Similar analysis of a randomised variant of HPI also yields a strong upper bound of (O($\sqrt{k \log k}))^n$ iterations, registering the first exponential improvement for HPI over the trivial bound of $k^n$. Our other contributions include a lower bound of $\Omega(n)$ iterations for RPI and an upper bound of $1.6001^n$ iterations for a randomised variant of “Batch-Switching PI” [Kalyanakrishnan et al., 2016a] on 2-action MDPs—the tightest strong upper bound shown yet for the PI family.

Cite this Paper

BibTeX


@InProceedings{pmlr-v115-taraviya20a,
  title = 	 {A Tighter Analysis of Randomised Policy Iteration},
  author =       {Taraviya, Meet and Kalyanakrishnan, Shivaram},
  booktitle = 	 {Proceedings of The 35th Uncertainty in Artificial Intelligence Conference},
  pages = 	 {519--529},
  year = 	 {2020},
  editor = 	 {Adams, Ryan P. and Gogate, Vibhav},
  volume = 	 {115},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {22--25 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v115/taraviya20a/taraviya20a.pdf},
  url = 	 {https://proceedings.mlr.press/v115/taraviya20a.html},
  abstract = 	 {Policy Iteration (PI) is a popular family of algorithms to compute an optimal policy for a givenMarkov Decision Problem (MDP). Starting from an arbitrary initial policy, PI repeatedly performs locally-improving switches until an optimal policy is found. The exact form of the switching rule gives rise to different variants of PI. Two decades ago, Mansour and Singh [1999] provided the first non-trivial “strong” upper bound on the number of iterations taken by “Howard’s PI” (HPI), a widelyused variant of PI (strong bounds depend only on the number of states and actions in the MDP). They also proposed a randomised variant (RPI) and showed an even tighter strong upper bound. Their bounds for HPI and RPI have not been improved subsequently.\\{We} revisit the algorithms and analysis of Mansour and Singh [1999]. We prove a novel result on the structure of the policy space for k-action MDPs, $k\geq 2$, which generalises a result known for k = 2. Also proposing a new counting argument, we obtain a strong bound of (O$(\sqrt{k \log k}))^n$ iterations for an algorithm akin to RPI, improving significantly upon Mansour and Singh’s original bound of roughly O($(k/2)^n$). Similar analysis of a randomised variant of HPI also yields a strong upper bound of (O($\sqrt{k \log k}))^n$ iterations, registering the first exponential improvement for HPI over the trivial bound of $k^n$. Our other contributions include a lower bound of  $\Omega(n)$ iterations for RPI and an upper bound of $1.6001^n$ iterations for a randomised variant of “Batch-Switching PI” [Kalyanakrishnan et al., 2016a] on 2-action MDPs—the tightest strong upper bound shown yet for the PI family.}
}

Endnote

%0 Conference Paper
%T A Tighter Analysis of Randomised Policy Iteration
%A Meet Taraviya
%A Shivaram Kalyanakrishnan
%B Proceedings of The 35th Uncertainty in Artificial Intelligence Conference
%C Proceedings of Machine Learning Research
%D 2020
%E Ryan P. Adams
%E Vibhav Gogate	
%F pmlr-v115-taraviya20a
%I PMLR
%P 519--529
%U https://proceedings.mlr.press/v115/taraviya20a.html
%V 115
%X Policy Iteration (PI) is a popular family of algorithms to compute an optimal policy for a givenMarkov Decision Problem (MDP). Starting from an arbitrary initial policy, PI repeatedly performs locally-improving switches until an optimal policy is found. The exact form of the switching rule gives rise to different variants of PI. Two decades ago, Mansour and Singh [1999] provided the first non-trivial “strong” upper bound on the number of iterations taken by “Howard’s PI” (HPI), a widelyused variant of PI (strong bounds depend only on the number of states and actions in the MDP). They also proposed a randomised variant (RPI) and showed an even tighter strong upper bound. Their bounds for HPI and RPI have not been improved subsequently.\\{We} revisit the algorithms and analysis of Mansour and Singh [1999]. We prove a novel result on the structure of the policy space for k-action MDPs, $k\geq 2$, which generalises a result known for k = 2. Also proposing a new counting argument, we obtain a strong bound of (O$(\sqrt{k \log k}))^n$ iterations for an algorithm akin to RPI, improving significantly upon Mansour and Singh’s original bound of roughly O($(k/2)^n$). Similar analysis of a randomised variant of HPI also yields a strong upper bound of (O($\sqrt{k \log k}))^n$ iterations, registering the first exponential improvement for HPI over the trivial bound of $k^n$. Our other contributions include a lower bound of  $\Omega(n)$ iterations for RPI and an upper bound of $1.6001^n$ iterations for a randomised variant of “Batch-Switching PI” [Kalyanakrishnan et al., 2016a] on 2-action MDPs—the tightest strong upper bound shown yet for the PI family.

APA


Taraviya, M. & Kalyanakrishnan, S.. (2020). A Tighter Analysis of Randomised Policy Iteration. Proceedings of The 35th Uncertainty in Artificial Intelligence Conference, in Proceedings of Machine Learning Research 115:519-529 Available from https://proceedings.mlr.press/v115/taraviya20a.html.

A Tighter Analysis of Randomised Policy Iteration

Abstract

Cite this Paper

Related Material