Contrastive Imitation Learning for Language-guided Multi-Task Robotic Manipulation

Teli Ma, Jiaming Zhou, Zifan Wang, Ronghe Qiu, Junwei Liang
Proceedings of The 8th Conference on Robot Learning, PMLR 270:4651-4669, 2025.

Abstract

Developing robots capable of executing various manipulation tasks, guided by natural language instructions and visual observations of intricate real-world environments, remains a significant challenge in robotics. Such robot agents need to understand linguistic commands and distinguish between the requirements of different tasks. In this work, we present Σ-agent, an end-to-end imitation learning agent for multi-task robotic manipulation. Σ-agent incorporates contrastive Imitation Learning (contrastive IL) modules to strengthen vision-language and current-future representations. An effective and efficient multi-view querying Transformer (MVQ-Former) for aggregating representative semantic information is introduced. Σ-agent shows substantial improvement over state-of-the-art methods under diverse settings in 18 RLBench tasks, surpassing RVT by an average of 5.2% and 5.9% in 10 and 100 demonstration training, respectively. Σ-agent also achieves 62% success rate with a single policy in 5 real-world manipulation tasks. The code will be released upon acceptance.

Cite this Paper


BibTeX
@InProceedings{pmlr-v270-ma25a, title = {Contrastive Imitation Learning for Language-guided Multi-Task Robotic Manipulation}, author = {Ma, Teli and Zhou, Jiaming and Wang, Zifan and Qiu, Ronghe and Liang, Junwei}, booktitle = {Proceedings of The 8th Conference on Robot Learning}, pages = {4651--4669}, year = {2025}, editor = {Agrawal, Pulkit and Kroemer, Oliver and Burgard, Wolfram}, volume = {270}, series = {Proceedings of Machine Learning Research}, month = {06--09 Nov}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v270/main/assets/ma25a/ma25a.pdf}, url = {https://proceedings.mlr.press/v270/ma25a.html}, abstract = {Developing robots capable of executing various manipulation tasks, guided by natural language instructions and visual observations of intricate real-world environments, remains a significant challenge in robotics. Such robot agents need to understand linguistic commands and distinguish between the requirements of different tasks. In this work, we present $\mathtt{\Sigma\mbox{-}agent}$, an end-to-end imitation learning agent for multi-task robotic manipulation. $\mathtt{\Sigma\mbox{-}agent}$ incorporates contrastive Imitation Learning (contrastive IL) modules to strengthen vision-language and current-future representations. An effective and efficient multi-view querying Transformer (MVQ-Former) for aggregating representative semantic information is introduced. $\mathtt{\Sigma\mbox{-}agent}$ shows substantial improvement over state-of-the-art methods under diverse settings in 18 RLBench tasks, surpassing RVT by an average of 5.2% and 5.9% in 10 and 100 demonstration training, respectively. $\mathtt{\Sigma\mbox{-}agent}$ also achieves 62% success rate with a single policy in 5 real-world manipulation tasks. The code will be released upon acceptance.} }
Endnote
%0 Conference Paper %T Contrastive Imitation Learning for Language-guided Multi-Task Robotic Manipulation %A Teli Ma %A Jiaming Zhou %A Zifan Wang %A Ronghe Qiu %A Junwei Liang %B Proceedings of The 8th Conference on Robot Learning %C Proceedings of Machine Learning Research %D 2025 %E Pulkit Agrawal %E Oliver Kroemer %E Wolfram Burgard %F pmlr-v270-ma25a %I PMLR %P 4651--4669 %U https://proceedings.mlr.press/v270/ma25a.html %V 270 %X Developing robots capable of executing various manipulation tasks, guided by natural language instructions and visual observations of intricate real-world environments, remains a significant challenge in robotics. Such robot agents need to understand linguistic commands and distinguish between the requirements of different tasks. In this work, we present $\mathtt{\Sigma\mbox{-}agent}$, an end-to-end imitation learning agent for multi-task robotic manipulation. $\mathtt{\Sigma\mbox{-}agent}$ incorporates contrastive Imitation Learning (contrastive IL) modules to strengthen vision-language and current-future representations. An effective and efficient multi-view querying Transformer (MVQ-Former) for aggregating representative semantic information is introduced. $\mathtt{\Sigma\mbox{-}agent}$ shows substantial improvement over state-of-the-art methods under diverse settings in 18 RLBench tasks, surpassing RVT by an average of 5.2% and 5.9% in 10 and 100 demonstration training, respectively. $\mathtt{\Sigma\mbox{-}agent}$ also achieves 62% success rate with a single policy in 5 real-world manipulation tasks. The code will be released upon acceptance.
APA
Ma, T., Zhou, J., Wang, Z., Qiu, R. & Liang, J.. (2025). Contrastive Imitation Learning for Language-guided Multi-Task Robotic Manipulation. Proceedings of The 8th Conference on Robot Learning, in Proceedings of Machine Learning Research 270:4651-4669 Available from https://proceedings.mlr.press/v270/ma25a.html.

Related Material