Sample-Efficient Preference-based Reinforcement Learning with Dynamics Aware Rewards

Katherine Metcalf, Miguel Sarabia, Natalie Mackraz, Barry-John Theobald
Proceedings of The 7th Conference on Robot Learning, PMLR 229:1484-1532, 2023.

Abstract

Preference-based reinforcement learning (PbRL) aligns a robot behavior with human preferences via a reward function learned from binary feedback over agent behaviors. We show that encoding environment dynamics in the reward function improves the sample efficiency of PbRL by an order of magnitude. In our experiments we iterate between: (1) encoding environment dynamics in a state-action representation $z^{sa}$ via a self-supervised temporal consistency task, and (2) bootstrapping the preference-based reward function from $z^{sa}$, which results in faster policy learning and better final policy performance. For example, on quadruped-walk, walker-walk, and cheetah-run, with 50 preference labels we achieve the same performance as existing approaches with 500 preference labels, and we recover $83%$ and $66%$ of ground truth reward policy performance versus only $38%$ and $21%$ without environment dynamics. The performance gains demonstrate that explicitly encoding environment dynamics improves preference-learned reward functions.

Cite this Paper


BibTeX
@InProceedings{pmlr-v229-metcalf23a, title = {Sample-Efficient Preference-based Reinforcement Learning with Dynamics Aware Rewards}, author = {Metcalf, Katherine and Sarabia, Miguel and Mackraz, Natalie and Theobald, Barry-John}, booktitle = {Proceedings of The 7th Conference on Robot Learning}, pages = {1484--1532}, year = {2023}, editor = {Tan, Jie and Toussaint, Marc and Darvish, Kourosh}, volume = {229}, series = {Proceedings of Machine Learning Research}, month = {06--09 Nov}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v229/metcalf23a/metcalf23a.pdf}, url = {https://proceedings.mlr.press/v229/metcalf23a.html}, abstract = {Preference-based reinforcement learning (PbRL) aligns a robot behavior with human preferences via a reward function learned from binary feedback over agent behaviors. We show that encoding environment dynamics in the reward function improves the sample efficiency of PbRL by an order of magnitude. In our experiments we iterate between: (1) encoding environment dynamics in a state-action representation $z^{sa}$ via a self-supervised temporal consistency task, and (2) bootstrapping the preference-based reward function from $z^{sa}$, which results in faster policy learning and better final policy performance. For example, on quadruped-walk, walker-walk, and cheetah-run, with 50 preference labels we achieve the same performance as existing approaches with 500 preference labels, and we recover $83%$ and $66%$ of ground truth reward policy performance versus only $38%$ and $21%$ without environment dynamics. The performance gains demonstrate that explicitly encoding environment dynamics improves preference-learned reward functions.} }
Endnote
%0 Conference Paper %T Sample-Efficient Preference-based Reinforcement Learning with Dynamics Aware Rewards %A Katherine Metcalf %A Miguel Sarabia %A Natalie Mackraz %A Barry-John Theobald %B Proceedings of The 7th Conference on Robot Learning %C Proceedings of Machine Learning Research %D 2023 %E Jie Tan %E Marc Toussaint %E Kourosh Darvish %F pmlr-v229-metcalf23a %I PMLR %P 1484--1532 %U https://proceedings.mlr.press/v229/metcalf23a.html %V 229 %X Preference-based reinforcement learning (PbRL) aligns a robot behavior with human preferences via a reward function learned from binary feedback over agent behaviors. We show that encoding environment dynamics in the reward function improves the sample efficiency of PbRL by an order of magnitude. In our experiments we iterate between: (1) encoding environment dynamics in a state-action representation $z^{sa}$ via a self-supervised temporal consistency task, and (2) bootstrapping the preference-based reward function from $z^{sa}$, which results in faster policy learning and better final policy performance. For example, on quadruped-walk, walker-walk, and cheetah-run, with 50 preference labels we achieve the same performance as existing approaches with 500 preference labels, and we recover $83%$ and $66%$ of ground truth reward policy performance versus only $38%$ and $21%$ without environment dynamics. The performance gains demonstrate that explicitly encoding environment dynamics improves preference-learned reward functions.
APA
Metcalf, K., Sarabia, M., Mackraz, N. & Theobald, B.. (2023). Sample-Efficient Preference-based Reinforcement Learning with Dynamics Aware Rewards. Proceedings of The 7th Conference on Robot Learning, in Proceedings of Machine Learning Research 229:1484-1532 Available from https://proceedings.mlr.press/v229/metcalf23a.html.

Related Material