Generalised Policy Improvement with Geometric Policy Composition

Shantanu Thakoor, Mark Rowland, Diana Borsa, Will Dabney, Remi Munos, Andre Barreto
Proceedings of the 39th International Conference on Machine Learning, PMLR 162:21272-21307, 2022.

Abstract

We introduce a method for policy improvement that interpolates between the greedy approach of value-based reinforcement learning (RL) and the full planning approach typical of model-based RL. The new method builds on the concept of a geometric horizon model (GHM, also known as a \gamma-model), which models the discounted state-visitation distribution of a given policy. We show that we can evaluate any non-Markov policy that switches between a set of base Markov policies with fixed probability by a careful composition of the base policy GHMs, without any additional learning. We can then apply generalised policy improvement (GPI) to collections of such non-Markov policies to obtain a new Markov policy that will in general outperform its precursors. We provide a thorough theoretical analysis of this approach, develop applications to transfer and standard RL, and empirically demonstrate its effectiveness over standard GPI on a challenging deep RL continuous control task. We also provide an analysis of GHM training methods, proving a novel convergence result regarding previously proposed methods and showing how to train these models stably in deep RL settings.

Cite this Paper


BibTeX
@InProceedings{pmlr-v162-thakoor22a, title = {Generalised Policy Improvement with Geometric Policy Composition}, author = {Thakoor, Shantanu and Rowland, Mark and Borsa, Diana and Dabney, Will and Munos, Remi and Barreto, Andre}, booktitle = {Proceedings of the 39th International Conference on Machine Learning}, pages = {21272--21307}, year = {2022}, editor = {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan}, volume = {162}, series = {Proceedings of Machine Learning Research}, month = {17--23 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v162/thakoor22a/thakoor22a.pdf}, url = {https://proceedings.mlr.press/v162/thakoor22a.html}, abstract = {We introduce a method for policy improvement that interpolates between the greedy approach of value-based reinforcement learning (RL) and the full planning approach typical of model-based RL. The new method builds on the concept of a geometric horizon model (GHM, also known as a \gamma-model), which models the discounted state-visitation distribution of a given policy. We show that we can evaluate any non-Markov policy that switches between a set of base Markov policies with fixed probability by a careful composition of the base policy GHMs, without any additional learning. We can then apply generalised policy improvement (GPI) to collections of such non-Markov policies to obtain a new Markov policy that will in general outperform its precursors. We provide a thorough theoretical analysis of this approach, develop applications to transfer and standard RL, and empirically demonstrate its effectiveness over standard GPI on a challenging deep RL continuous control task. We also provide an analysis of GHM training methods, proving a novel convergence result regarding previously proposed methods and showing how to train these models stably in deep RL settings.} }
Endnote
%0 Conference Paper %T Generalised Policy Improvement with Geometric Policy Composition %A Shantanu Thakoor %A Mark Rowland %A Diana Borsa %A Will Dabney %A Remi Munos %A Andre Barreto %B Proceedings of the 39th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2022 %E Kamalika Chaudhuri %E Stefanie Jegelka %E Le Song %E Csaba Szepesvari %E Gang Niu %E Sivan Sabato %F pmlr-v162-thakoor22a %I PMLR %P 21272--21307 %U https://proceedings.mlr.press/v162/thakoor22a.html %V 162 %X We introduce a method for policy improvement that interpolates between the greedy approach of value-based reinforcement learning (RL) and the full planning approach typical of model-based RL. The new method builds on the concept of a geometric horizon model (GHM, also known as a \gamma-model), which models the discounted state-visitation distribution of a given policy. We show that we can evaluate any non-Markov policy that switches between a set of base Markov policies with fixed probability by a careful composition of the base policy GHMs, without any additional learning. We can then apply generalised policy improvement (GPI) to collections of such non-Markov policies to obtain a new Markov policy that will in general outperform its precursors. We provide a thorough theoretical analysis of this approach, develop applications to transfer and standard RL, and empirically demonstrate its effectiveness over standard GPI on a challenging deep RL continuous control task. We also provide an analysis of GHM training methods, proving a novel convergence result regarding previously proposed methods and showing how to train these models stably in deep RL settings.
APA
Thakoor, S., Rowland, M., Borsa, D., Dabney, W., Munos, R. & Barreto, A.. (2022). Generalised Policy Improvement with Geometric Policy Composition. Proceedings of the 39th International Conference on Machine Learning, in Proceedings of Machine Learning Research 162:21272-21307 Available from https://proceedings.mlr.press/v162/thakoor22a.html.

Related Material