Model-free Policy Learning with Reward Gradients

Qingfeng Lan, Samuele Tosatto, Homayoon Farrahi, Rupam Mahmood
Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, PMLR 151:4217-4234, 2022.

Abstract

Despite the increasing popularity of policy gradient methods, they are yet to be widely utilized in sample-scarce applications, such as robotics. The sample efficiency could be improved by making best usage of available information. As a key component in reinforcement learning, the reward function is usually devised carefully to guide the agent. Hence, the reward function is usually known, allowing access to not only scalar reward signals but also reward gradients. To benefit from reward gradients, previous works require the knowledge of environment dynamics, which are hard to obtain. In this work, we develop the Reward Policy Gradient estimator, a novel approach that integrates reward gradients without learning a model. Bypassing the model dynamics allows our estimator to achieve a better bias-variance trade-off, which results in a higher sample efficiency, as shown in the empirical analysis. Our method also boosts the performance of Proximal Policy Optimization on different MuJoCo control tasks.

Cite this Paper


BibTeX
@InProceedings{pmlr-v151-lan22a, title = { Model-free Policy Learning with Reward Gradients }, author = {Lan, Qingfeng and Tosatto, Samuele and Farrahi, Homayoon and Mahmood, Rupam}, booktitle = {Proceedings of The 25th International Conference on Artificial Intelligence and Statistics}, pages = {4217--4234}, year = {2022}, editor = {Camps-Valls, Gustau and Ruiz, Francisco J. R. and Valera, Isabel}, volume = {151}, series = {Proceedings of Machine Learning Research}, month = {28--30 Mar}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v151/lan22a/lan22a.pdf}, url = {https://proceedings.mlr.press/v151/lan22a.html}, abstract = { Despite the increasing popularity of policy gradient methods, they are yet to be widely utilized in sample-scarce applications, such as robotics. The sample efficiency could be improved by making best usage of available information. As a key component in reinforcement learning, the reward function is usually devised carefully to guide the agent. Hence, the reward function is usually known, allowing access to not only scalar reward signals but also reward gradients. To benefit from reward gradients, previous works require the knowledge of environment dynamics, which are hard to obtain. In this work, we develop the Reward Policy Gradient estimator, a novel approach that integrates reward gradients without learning a model. Bypassing the model dynamics allows our estimator to achieve a better bias-variance trade-off, which results in a higher sample efficiency, as shown in the empirical analysis. Our method also boosts the performance of Proximal Policy Optimization on different MuJoCo control tasks. } }
Endnote
%0 Conference Paper %T Model-free Policy Learning with Reward Gradients %A Qingfeng Lan %A Samuele Tosatto %A Homayoon Farrahi %A Rupam Mahmood %B Proceedings of The 25th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2022 %E Gustau Camps-Valls %E Francisco J. R. Ruiz %E Isabel Valera %F pmlr-v151-lan22a %I PMLR %P 4217--4234 %U https://proceedings.mlr.press/v151/lan22a.html %V 151 %X Despite the increasing popularity of policy gradient methods, they are yet to be widely utilized in sample-scarce applications, such as robotics. The sample efficiency could be improved by making best usage of available information. As a key component in reinforcement learning, the reward function is usually devised carefully to guide the agent. Hence, the reward function is usually known, allowing access to not only scalar reward signals but also reward gradients. To benefit from reward gradients, previous works require the knowledge of environment dynamics, which are hard to obtain. In this work, we develop the Reward Policy Gradient estimator, a novel approach that integrates reward gradients without learning a model. Bypassing the model dynamics allows our estimator to achieve a better bias-variance trade-off, which results in a higher sample efficiency, as shown in the empirical analysis. Our method also boosts the performance of Proximal Policy Optimization on different MuJoCo control tasks.
APA
Lan, Q., Tosatto, S., Farrahi, H. & Mahmood, R.. (2022). Model-free Policy Learning with Reward Gradients . Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 151:4217-4234 Available from https://proceedings.mlr.press/v151/lan22a.html.

Related Material