A Tighter Analysis of Randomised Policy Iteration

Meet Taraviya, Shivaram Kalyanakrishnan
Proceedings of The 35th Uncertainty in Artificial Intelligence Conference, PMLR 115:519-529, 2020.

Abstract

Policy Iteration (PI) is a popular family of algorithms to compute an optimal policy for a givenMarkov Decision Problem (MDP). Starting from an arbitrary initial policy, PI repeatedly performs locally-improving switches until an optimal policy is found. The exact form of the switching rule gives rise to different variants of PI. Two decades ago, Mansour and Singh [1999] provided the first non-trivial “strong” upper bound on the number of iterations taken by “Howard’s PI” (HPI), a widelyused variant of PI (strong bounds depend only on the number of states and actions in the MDP). They also proposed a randomised variant (RPI) and showed an even tighter strong upper bound. Their bounds for HPI and RPI have not been improved subsequently.\\{We} revisit the algorithms and analysis of Mansour and Singh [1999]. We prove a novel result on the structure of the policy space for k-action MDPs, $k\geq 2$, which generalises a result known for k = 2. Also proposing a new counting argument, we obtain a strong bound of (O$(\sqrt{k \log k}))^n$ iterations for an algorithm akin to RPI, improving significantly upon Mansour and Singh’s original bound of roughly O($(k/2)^n$). Similar analysis of a randomised variant of HPI also yields a strong upper bound of (O($\sqrt{k \log k}))^n$ iterations, registering the first exponential improvement for HPI over the trivial bound of $k^n$. Our other contributions include a lower bound of $\Omega(n)$ iterations for RPI and an upper bound of $1.6001^n$ iterations for a randomised variant of “Batch-Switching PI” [Kalyanakrishnan et al., 2016a] on 2-action MDPs—the tightest strong upper bound shown yet for the PI family.

Cite this Paper


BibTeX
@InProceedings{pmlr-v115-taraviya20a, title = {A Tighter Analysis of Randomised Policy Iteration}, author = {Taraviya, Meet and Kalyanakrishnan, Shivaram}, booktitle = {Proceedings of The 35th Uncertainty in Artificial Intelligence Conference}, pages = {519--529}, year = {2020}, editor = {Adams, Ryan P. and Gogate, Vibhav}, volume = {115}, series = {Proceedings of Machine Learning Research}, month = {22--25 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v115/taraviya20a/taraviya20a.pdf}, url = {https://proceedings.mlr.press/v115/taraviya20a.html}, abstract = {Policy Iteration (PI) is a popular family of algorithms to compute an optimal policy for a givenMarkov Decision Problem (MDP). Starting from an arbitrary initial policy, PI repeatedly performs locally-improving switches until an optimal policy is found. The exact form of the switching rule gives rise to different variants of PI. Two decades ago, Mansour and Singh [1999] provided the first non-trivial “strong” upper bound on the number of iterations taken by “Howard’s PI” (HPI), a widelyused variant of PI (strong bounds depend only on the number of states and actions in the MDP). They also proposed a randomised variant (RPI) and showed an even tighter strong upper bound. Their bounds for HPI and RPI have not been improved subsequently.\\{We} revisit the algorithms and analysis of Mansour and Singh [1999]. We prove a novel result on the structure of the policy space for k-action MDPs, $k\geq 2$, which generalises a result known for k = 2. Also proposing a new counting argument, we obtain a strong bound of (O$(\sqrt{k \log k}))^n$ iterations for an algorithm akin to RPI, improving significantly upon Mansour and Singh’s original bound of roughly O($(k/2)^n$). Similar analysis of a randomised variant of HPI also yields a strong upper bound of (O($\sqrt{k \log k}))^n$ iterations, registering the first exponential improvement for HPI over the trivial bound of $k^n$. Our other contributions include a lower bound of $\Omega(n)$ iterations for RPI and an upper bound of $1.6001^n$ iterations for a randomised variant of “Batch-Switching PI” [Kalyanakrishnan et al., 2016a] on 2-action MDPs—the tightest strong upper bound shown yet for the PI family.} }
Endnote
%0 Conference Paper %T A Tighter Analysis of Randomised Policy Iteration %A Meet Taraviya %A Shivaram Kalyanakrishnan %B Proceedings of The 35th Uncertainty in Artificial Intelligence Conference %C Proceedings of Machine Learning Research %D 2020 %E Ryan P. Adams %E Vibhav Gogate %F pmlr-v115-taraviya20a %I PMLR %P 519--529 %U https://proceedings.mlr.press/v115/taraviya20a.html %V 115 %X Policy Iteration (PI) is a popular family of algorithms to compute an optimal policy for a givenMarkov Decision Problem (MDP). Starting from an arbitrary initial policy, PI repeatedly performs locally-improving switches until an optimal policy is found. The exact form of the switching rule gives rise to different variants of PI. Two decades ago, Mansour and Singh [1999] provided the first non-trivial “strong” upper bound on the number of iterations taken by “Howard’s PI” (HPI), a widelyused variant of PI (strong bounds depend only on the number of states and actions in the MDP). They also proposed a randomised variant (RPI) and showed an even tighter strong upper bound. Their bounds for HPI and RPI have not been improved subsequently.\\{We} revisit the algorithms and analysis of Mansour and Singh [1999]. We prove a novel result on the structure of the policy space for k-action MDPs, $k\geq 2$, which generalises a result known for k = 2. Also proposing a new counting argument, we obtain a strong bound of (O$(\sqrt{k \log k}))^n$ iterations for an algorithm akin to RPI, improving significantly upon Mansour and Singh’s original bound of roughly O($(k/2)^n$). Similar analysis of a randomised variant of HPI also yields a strong upper bound of (O($\sqrt{k \log k}))^n$ iterations, registering the first exponential improvement for HPI over the trivial bound of $k^n$. Our other contributions include a lower bound of $\Omega(n)$ iterations for RPI and an upper bound of $1.6001^n$ iterations for a randomised variant of “Batch-Switching PI” [Kalyanakrishnan et al., 2016a] on 2-action MDPs—the tightest strong upper bound shown yet for the PI family.
APA
Taraviya, M. & Kalyanakrishnan, S.. (2020). A Tighter Analysis of Randomised Policy Iteration. Proceedings of The 35th Uncertainty in Artificial Intelligence Conference, in Proceedings of Machine Learning Research 115:519-529 Available from https://proceedings.mlr.press/v115/taraviya20a.html.

Related Material