Learning Reward for Robot Skills Using Large Language Models via Self-Alignment

Yuwei Zeng, Yao Mu, Lin Shao
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:58366-58386, 2024.

Abstract

Learning reward functions remains the bottleneck to equip a robot with a broad repertoire of skills. Large Language Models (LLM) contain valuable task-related knowledge that can potentially aid in the learning of reward functions. However, the proposed reward function can be imprecise, thus ineffective which requires to be further grounded with environment information. We proposed a method to learn rewards more efficiently in the absence of humans. Our approach consists of two components: We first use the LLM to propose features and parameterization of the reward, then update the parameters through an iterative self-alignment process. In particular, the process minimizes the ranking inconsistency between the LLM and the learnt reward functions based on the execution feedback. The method was validated on 9 tasks across 2 simulation environments. It demonstrates a consistent improvement in training efficacy and efficiency, meanwhile consuming significantly fewer GPT tokens compared to the alternative mutation-based method.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-zeng24d, title = {Learning Reward for Robot Skills Using Large Language Models via Self-Alignment}, author = {Zeng, Yuwei and Mu, Yao and Shao, Lin}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {58366--58386}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/zeng24d/zeng24d.pdf}, url = {https://proceedings.mlr.press/v235/zeng24d.html}, abstract = {Learning reward functions remains the bottleneck to equip a robot with a broad repertoire of skills. Large Language Models (LLM) contain valuable task-related knowledge that can potentially aid in the learning of reward functions. However, the proposed reward function can be imprecise, thus ineffective which requires to be further grounded with environment information. We proposed a method to learn rewards more efficiently in the absence of humans. Our approach consists of two components: We first use the LLM to propose features and parameterization of the reward, then update the parameters through an iterative self-alignment process. In particular, the process minimizes the ranking inconsistency between the LLM and the learnt reward functions based on the execution feedback. The method was validated on 9 tasks across 2 simulation environments. It demonstrates a consistent improvement in training efficacy and efficiency, meanwhile consuming significantly fewer GPT tokens compared to the alternative mutation-based method.} }
Endnote
%0 Conference Paper %T Learning Reward for Robot Skills Using Large Language Models via Self-Alignment %A Yuwei Zeng %A Yao Mu %A Lin Shao %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-zeng24d %I PMLR %P 58366--58386 %U https://proceedings.mlr.press/v235/zeng24d.html %V 235 %X Learning reward functions remains the bottleneck to equip a robot with a broad repertoire of skills. Large Language Models (LLM) contain valuable task-related knowledge that can potentially aid in the learning of reward functions. However, the proposed reward function can be imprecise, thus ineffective which requires to be further grounded with environment information. We proposed a method to learn rewards more efficiently in the absence of humans. Our approach consists of two components: We first use the LLM to propose features and parameterization of the reward, then update the parameters through an iterative self-alignment process. In particular, the process minimizes the ranking inconsistency between the LLM and the learnt reward functions based on the execution feedback. The method was validated on 9 tasks across 2 simulation environments. It demonstrates a consistent improvement in training efficacy and efficiency, meanwhile consuming significantly fewer GPT tokens compared to the alternative mutation-based method.
APA
Zeng, Y., Mu, Y. & Shao, L.. (2024). Learning Reward for Robot Skills Using Large Language Models via Self-Alignment. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:58366-58386 Available from https://proceedings.mlr.press/v235/zeng24d.html.

Related Material