MEReQ: Max-Ent Residual-Q Inverse RL for Sample-Efficient Alignment from Intervention

Yuxin Chen, Chen Tang, Jianglan Wei, Chenran Li, Thomas Tian, Xiang Zhang, Wei Zhan, Peter Stone, Masayoshi Tomizuka
Proceedings of The 9th Conference on Robot Learning, PMLR 305:2121-2142, 2025.

Abstract

Aligning robot behavior with human preferences is crucial for deploying embodied AI agents in human-centered environments. A promising solution is interactive imitation learning from human intervention, where a human expert observes the policy’s execution and provides interventions as feedback. However, existing methods often fail to utilize the prior policy efficiently to facilitate learning, thus hindering sample efficiency. In this work, we introduce Maximum-Entropy Residual-Q Inverse Reinforcement Learning, designed for sample-efficient alignment from human intervention. Instead of inferring the complete human behavior characteristics, MEReQ infers a residual reward function that captures the discrepancy between the human expert’s and the prior policy’s underlying reward functions. It then employs Residual Q-Learning (RQL) to align the policy with human preferences using this residual reward function. Extensive evaluations on simulated and real-world tasks demonstrate that MEReQ achieves sample-efficient policy alignment from human intervention compared to other baselines.

Cite this Paper


BibTeX
@InProceedings{pmlr-v305-chen25c, title = {MEReQ: Max-Ent Residual-Q Inverse RL for Sample-Efficient Alignment from Intervention}, author = {Chen, Yuxin and Tang, Chen and Wei, Jianglan and Li, Chenran and Tian, Thomas and Zhang, Xiang and Zhan, Wei and Stone, Peter and Tomizuka, Masayoshi}, booktitle = {Proceedings of The 9th Conference on Robot Learning}, pages = {2121--2142}, year = {2025}, editor = {Lim, Joseph and Song, Shuran and Park, Hae-Won}, volume = {305}, series = {Proceedings of Machine Learning Research}, month = {27--30 Sep}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v305/main/assets/chen25c/chen25c.pdf}, url = {https://proceedings.mlr.press/v305/chen25c.html}, abstract = {Aligning robot behavior with human preferences is crucial for deploying embodied AI agents in human-centered environments. A promising solution is interactive imitation learning from human intervention, where a human expert observes the policy’s execution and provides interventions as feedback. However, existing methods often fail to utilize the prior policy efficiently to facilitate learning, thus hindering sample efficiency. In this work, we introduce Maximum-Entropy Residual-Q Inverse Reinforcement Learning, designed for sample-efficient alignment from human intervention. Instead of inferring the complete human behavior characteristics, MEReQ infers a residual reward function that captures the discrepancy between the human expert’s and the prior policy’s underlying reward functions. It then employs Residual Q-Learning (RQL) to align the policy with human preferences using this residual reward function. Extensive evaluations on simulated and real-world tasks demonstrate that MEReQ achieves sample-efficient policy alignment from human intervention compared to other baselines.} }
Endnote
%0 Conference Paper %T MEReQ: Max-Ent Residual-Q Inverse RL for Sample-Efficient Alignment from Intervention %A Yuxin Chen %A Chen Tang %A Jianglan Wei %A Chenran Li %A Thomas Tian %A Xiang Zhang %A Wei Zhan %A Peter Stone %A Masayoshi Tomizuka %B Proceedings of The 9th Conference on Robot Learning %C Proceedings of Machine Learning Research %D 2025 %E Joseph Lim %E Shuran Song %E Hae-Won Park %F pmlr-v305-chen25c %I PMLR %P 2121--2142 %U https://proceedings.mlr.press/v305/chen25c.html %V 305 %X Aligning robot behavior with human preferences is crucial for deploying embodied AI agents in human-centered environments. A promising solution is interactive imitation learning from human intervention, where a human expert observes the policy’s execution and provides interventions as feedback. However, existing methods often fail to utilize the prior policy efficiently to facilitate learning, thus hindering sample efficiency. In this work, we introduce Maximum-Entropy Residual-Q Inverse Reinforcement Learning, designed for sample-efficient alignment from human intervention. Instead of inferring the complete human behavior characteristics, MEReQ infers a residual reward function that captures the discrepancy between the human expert’s and the prior policy’s underlying reward functions. It then employs Residual Q-Learning (RQL) to align the policy with human preferences using this residual reward function. Extensive evaluations on simulated and real-world tasks demonstrate that MEReQ achieves sample-efficient policy alignment from human intervention compared to other baselines.
APA
Chen, Y., Tang, C., Wei, J., Li, C., Tian, T., Zhang, X., Zhan, W., Stone, P. & Tomizuka, M.. (2025). MEReQ: Max-Ent Residual-Q Inverse RL for Sample-Efficient Alignment from Intervention. Proceedings of The 9th Conference on Robot Learning, in Proceedings of Machine Learning Research 305:2121-2142 Available from https://proceedings.mlr.press/v305/chen25c.html.

Related Material