DecisionNCE: Embodied Multimodal Representations via Implicit Preference Learning

Jianxiong Li, Jinliang Zheng, Yinan Zheng, Liyuan Mao, Xiao Hu, Sijie Cheng, Haoyi Niu, Jihao Liu, Yu Liu, Jingjing Liu, Ya-Qin Zhang, Xianyuan Zhan
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:29461-29488, 2024.

Abstract

Multimodal pretraining is an effective strategy for the trinity of goals of representation learning in autonomous robots: $1)$ extracting both local and global task progressions; $2)$ enforcing temporal consistency of visual representation; $3)$ capturing trajectory-level language grounding. Most existing methods approach these via separate objectives, which often reach sub-optimal solutions. In this paper, we propose a universal unified objective that can simultaneously extract meaningful task progression information from image sequences and seamlessly align them with language instructions. We discover that via implicit preferences, where a visual trajectory inherently aligns better with its corresponding language instruction than mismatched pairs, the popular Bradley-Terry model can transform into representation learning through proper reward reparameterizations. The resulted framework, DecisionNCE, mirrors an InfoNCE-style objective but is distinctively tailored for decision-making tasks, providing an embodied representation learning framework that elegantly extracts both local and global task progression features, with temporal consistency enforced through implicit time contrastive learning, while ensuring trajectory-level instruction grounding via multimodal joint encoding. Evaluation on both simulated and real robots demonstrates that DecisionNCE effectively facilitates diverse downstream policy learning tasks, offering a versatile solution for unified representation and reward learning. Project Page: https://2toinf.github.io/DecisionNCE/

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-li24cr, title = {{D}ecision{NCE}: Embodied Multimodal Representations via Implicit Preference Learning}, author = {Li, Jianxiong and Zheng, Jinliang and Zheng, Yinan and Mao, Liyuan and Hu, Xiao and Cheng, Sijie and Niu, Haoyi and Liu, Jihao and Liu, Yu and Liu, Jingjing and Zhang, Ya-Qin and Zhan, Xianyuan}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {29461--29488}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/li24cr/li24cr.pdf}, url = {https://proceedings.mlr.press/v235/li24cr.html}, abstract = {Multimodal pretraining is an effective strategy for the trinity of goals of representation learning in autonomous robots: $1)$ extracting both local and global task progressions; $2)$ enforcing temporal consistency of visual representation; $3)$ capturing trajectory-level language grounding. Most existing methods approach these via separate objectives, which often reach sub-optimal solutions. In this paper, we propose a universal unified objective that can simultaneously extract meaningful task progression information from image sequences and seamlessly align them with language instructions. We discover that via implicit preferences, where a visual trajectory inherently aligns better with its corresponding language instruction than mismatched pairs, the popular Bradley-Terry model can transform into representation learning through proper reward reparameterizations. The resulted framework, DecisionNCE, mirrors an InfoNCE-style objective but is distinctively tailored for decision-making tasks, providing an embodied representation learning framework that elegantly extracts both local and global task progression features, with temporal consistency enforced through implicit time contrastive learning, while ensuring trajectory-level instruction grounding via multimodal joint encoding. Evaluation on both simulated and real robots demonstrates that DecisionNCE effectively facilitates diverse downstream policy learning tasks, offering a versatile solution for unified representation and reward learning. Project Page: https://2toinf.github.io/DecisionNCE/} }
Endnote
%0 Conference Paper %T DecisionNCE: Embodied Multimodal Representations via Implicit Preference Learning %A Jianxiong Li %A Jinliang Zheng %A Yinan Zheng %A Liyuan Mao %A Xiao Hu %A Sijie Cheng %A Haoyi Niu %A Jihao Liu %A Yu Liu %A Jingjing Liu %A Ya-Qin Zhang %A Xianyuan Zhan %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-li24cr %I PMLR %P 29461--29488 %U https://proceedings.mlr.press/v235/li24cr.html %V 235 %X Multimodal pretraining is an effective strategy for the trinity of goals of representation learning in autonomous robots: $1)$ extracting both local and global task progressions; $2)$ enforcing temporal consistency of visual representation; $3)$ capturing trajectory-level language grounding. Most existing methods approach these via separate objectives, which often reach sub-optimal solutions. In this paper, we propose a universal unified objective that can simultaneously extract meaningful task progression information from image sequences and seamlessly align them with language instructions. We discover that via implicit preferences, where a visual trajectory inherently aligns better with its corresponding language instruction than mismatched pairs, the popular Bradley-Terry model can transform into representation learning through proper reward reparameterizations. The resulted framework, DecisionNCE, mirrors an InfoNCE-style objective but is distinctively tailored for decision-making tasks, providing an embodied representation learning framework that elegantly extracts both local and global task progression features, with temporal consistency enforced through implicit time contrastive learning, while ensuring trajectory-level instruction grounding via multimodal joint encoding. Evaluation on both simulated and real robots demonstrates that DecisionNCE effectively facilitates diverse downstream policy learning tasks, offering a versatile solution for unified representation and reward learning. Project Page: https://2toinf.github.io/DecisionNCE/
APA
Li, J., Zheng, J., Zheng, Y., Mao, L., Hu, X., Cheng, S., Niu, H., Liu, J., Liu, Y., Liu, J., Zhang, Y. & Zhan, X.. (2024). DecisionNCE: Embodied Multimodal Representations via Implicit Preference Learning. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:29461-29488 Available from https://proceedings.mlr.press/v235/li24cr.html.

Related Material