Contrastive Imitation Learning for Language-guided Multi-Task Robotic Manipulation

Teli Ma; Jiaming Zhou; Zifan Wang; Ronghe Qiu; Junwei Liang

Contrastive Imitation Learning for Language-guided Multi-Task Robotic Manipulation

Teli Ma, Jiaming Zhou, Zifan Wang, Ronghe Qiu, Junwei Liang

Proceedings of The 8th Conference on Robot Learning, PMLR 270:4651-4669, 2025.

Abstract

Developing robots capable of executing various manipulation tasks, guided by natural language instructions and visual observations of intricate real-world environments, remains a significant challenge in robotics. Such robot agents need to understand linguistic commands and distinguish between the requirements of different tasks. In this work, we present

$\mathtt{\Sigma\mbox{-}agent}$ , an end-to-end imitation learning agent for multi-task robotic manipulation.

$\mathtt{\Sigma\mbox{-}agent}$ incorporates contrastive Imitation Learning (contrastive IL) modules to strengthen vision-language and current-future representations. An effective and efficient multi-view querying Transformer (MVQ-Former) for aggregating representative semantic information is introduced.

$\mathtt{\Sigma\mbox{-}agent}$ shows substantial improvement over state-of-the-art methods under diverse settings in 18 RLBench tasks, surpassing RVT by an average of 5.2% and 5.9% in 10 and 100 demonstration training, respectively.

$\mathtt{\Sigma\mbox{-}agent}$ also achieves 62% success rate with a single policy in 5 real-world manipulation tasks. The code will be released upon acceptance.

Cite this Paper

BibTeX

@InProceedings{pmlr-v270-ma25a,
  title = 	 {Contrastive Imitation Learning for Language-guided Multi-Task Robotic Manipulation},
  author =       {Ma, Teli and Zhou, Jiaming and Wang, Zifan and Qiu, Ronghe and Liang, Junwei},
  booktitle = 	 {Proceedings of The 8th Conference on Robot Learning},
  pages = 	 {4651--4669},
  year = 	 {2025},
  editor = 	 {Agrawal, Pulkit and Kroemer, Oliver and Burgard, Wolfram},
  volume = 	 {270},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {06--09 Nov},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v270/main/assets/ma25a/ma25a.pdf},
  url = 	 {https://proceedings.mlr.press/v270/ma25a.html},
  abstract = 	 {Developing robots capable of executing various manipulation tasks, guided by natural language instructions and visual observations of intricate real-world environments, remains a significant challenge in robotics. Such robot agents need to understand linguistic commands and distinguish between the requirements of different tasks. In this work, we present $\mathtt{\Sigma\mbox{-}agent}$, an end-to-end imitation learning agent for multi-task robotic manipulation. $\mathtt{\Sigma\mbox{-}agent}$ incorporates contrastive Imitation Learning (contrastive IL) modules to strengthen vision-language and current-future representations. An effective and efficient multi-view querying Transformer (MVQ-Former) for aggregating representative semantic information is introduced. $\mathtt{\Sigma\mbox{-}agent}$ shows substantial improvement over state-of-the-art methods under diverse settings in 18 RLBench tasks, surpassing RVT by an average of 5.2% and 5.9% in 10 and 100 demonstration training, respectively. $\mathtt{\Sigma\mbox{-}agent}$ also achieves 62% success rate with a single policy in 5 real-world manipulation tasks. The code will be released upon acceptance.}
}

Endnote

%0 Conference Paper
%T Contrastive Imitation Learning for Language-guided Multi-Task Robotic Manipulation
%A Teli Ma
%A Jiaming Zhou
%A Zifan Wang
%A Ronghe Qiu
%A Junwei Liang
%B Proceedings of The 8th Conference on Robot Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Pulkit Agrawal
%E Oliver Kroemer
%E Wolfram Burgard	
%F pmlr-v270-ma25a
%I PMLR
%P 4651--4669
%U https://proceedings.mlr.press/v270/ma25a.html
%V 270
%X Developing robots capable of executing various manipulation tasks, guided by natural language instructions and visual observations of intricate real-world environments, remains a significant challenge in robotics. Such robot agents need to understand linguistic commands and distinguish between the requirements of different tasks. In this work, we present $\mathtt{\Sigma\mbox{-}agent}$, an end-to-end imitation learning agent for multi-task robotic manipulation. $\mathtt{\Sigma\mbox{-}agent}$ incorporates contrastive Imitation Learning (contrastive IL) modules to strengthen vision-language and current-future representations. An effective and efficient multi-view querying Transformer (MVQ-Former) for aggregating representative semantic information is introduced. $\mathtt{\Sigma\mbox{-}agent}$ shows substantial improvement over state-of-the-art methods under diverse settings in 18 RLBench tasks, surpassing RVT by an average of 5.2% and 5.9% in 10 and 100 demonstration training, respectively. $\mathtt{\Sigma\mbox{-}agent}$ also achieves 62% success rate with a single policy in 5 real-world manipulation tasks. The code will be released upon acceptance.

APA

Ma, T., Zhou, J., Wang, Z., Qiu, R. & Liang, J.. (2025). Contrastive Imitation Learning for Language-guided Multi-Task Robotic Manipulation. Proceedings of The 8th Conference on Robot Learning, in Proceedings of Machine Learning Research 270:4651-4669 Available from https://proceedings.mlr.press/v270/ma25a.html.

Contrastive Imitation Learning for Language-guided Multi-Task Robotic Manipulation

Abstract

Cite this Paper

Related Material