Learning to Score Behaviors for Guided Policy Optimization

Aldo Pacchiano, Jack Parker-Holder, Yunhao Tang, Krzysztof Choromanski, Anna Choromanska, Michael Jordan
Proceedings of the 37th International Conference on Machine Learning, PMLR 119:7445-7454, 2020.

Abstract

We introduce a new approach for comparing reinforcement learning policies, using Wasserstein distances (WDs) in a newly defined latent behavioral space. We show that by utilizing the dual formulation of the WD, we can learn score functions over policy behaviors that can in turn be used to lead policy optimization towards (or away from) (un)desired behaviors. Combined with smoothed WDs, the dual formulation allows us to devise efficient algorithms that take stochastic gradient descent steps through WD regularizers. We incorporate these regularizers into two novel on-policy algorithms, Behavior-Guided Policy Gradient and Behavior-Guided Evolution Strategies, which we demonstrate can outperform existing methods in a variety of challenging environments. We also provide an open source demo.

Cite this Paper


BibTeX
@InProceedings{pmlr-v119-pacchiano20a, title = {Learning to Score Behaviors for Guided Policy Optimization}, author = {Pacchiano, Aldo and Parker-Holder, Jack and Tang, Yunhao and Choromanski, Krzysztof and Choromanska, Anna and Jordan, Michael}, booktitle = {Proceedings of the 37th International Conference on Machine Learning}, pages = {7445--7454}, year = {2020}, editor = {III, Hal Daumé and Singh, Aarti}, volume = {119}, series = {Proceedings of Machine Learning Research}, month = {13--18 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v119/pacchiano20a/pacchiano20a.pdf}, url = {https://proceedings.mlr.press/v119/pacchiano20a.html}, abstract = {We introduce a new approach for comparing reinforcement learning policies, using Wasserstein distances (WDs) in a newly defined latent behavioral space. We show that by utilizing the dual formulation of the WD, we can learn score functions over policy behaviors that can in turn be used to lead policy optimization towards (or away from) (un)desired behaviors. Combined with smoothed WDs, the dual formulation allows us to devise efficient algorithms that take stochastic gradient descent steps through WD regularizers. We incorporate these regularizers into two novel on-policy algorithms, Behavior-Guided Policy Gradient and Behavior-Guided Evolution Strategies, which we demonstrate can outperform existing methods in a variety of challenging environments. We also provide an open source demo.} }
Endnote
%0 Conference Paper %T Learning to Score Behaviors for Guided Policy Optimization %A Aldo Pacchiano %A Jack Parker-Holder %A Yunhao Tang %A Krzysztof Choromanski %A Anna Choromanska %A Michael Jordan %B Proceedings of the 37th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2020 %E Hal Daumé III %E Aarti Singh %F pmlr-v119-pacchiano20a %I PMLR %P 7445--7454 %U https://proceedings.mlr.press/v119/pacchiano20a.html %V 119 %X We introduce a new approach for comparing reinforcement learning policies, using Wasserstein distances (WDs) in a newly defined latent behavioral space. We show that by utilizing the dual formulation of the WD, we can learn score functions over policy behaviors that can in turn be used to lead policy optimization towards (or away from) (un)desired behaviors. Combined with smoothed WDs, the dual formulation allows us to devise efficient algorithms that take stochastic gradient descent steps through WD regularizers. We incorporate these regularizers into two novel on-policy algorithms, Behavior-Guided Policy Gradient and Behavior-Guided Evolution Strategies, which we demonstrate can outperform existing methods in a variety of challenging environments. We also provide an open source demo.
APA
Pacchiano, A., Parker-Holder, J., Tang, Y., Choromanski, K., Choromanska, A. & Jordan, M.. (2020). Learning to Score Behaviors for Guided Policy Optimization. Proceedings of the 37th International Conference on Machine Learning, in Proceedings of Machine Learning Research 119:7445-7454 Available from https://proceedings.mlr.press/v119/pacchiano20a.html.

Related Material