Task Success Prediction for Open-Vocabulary Manipulation Based on Multi-Level Aligned Representations

Miyu Goko, Motonari Kambara, Daichi Saito, Seitaro Otsuki, Komei Sugiura
Proceedings of The 8th Conference on Robot Learning, PMLR 270:3242-3263, 2025.

Abstract

In this study, we consider the problem of predicting task success for open-vocabulary manipulation by a manipulator, based on instruction sentences and egocentric images before and after manipulation. Conventional approaches, including multimodal large language models (MLLMs), often fail to appropriately understand detailed characteristics of objects and/or subtle changes in the position of objects. We propose Contrastive λ-Repformer, which predicts task success for table-top manipulation tasks by aligning images with instruction sentences. Our method integrates the following three key types of features into a multi-level aligned representation: features that preserve local image information; features aligned with natural language; and features structured through natural language. This allows the model to focus on important changes by looking at the differences in the representation between two images. We evaluate Contrastive λ-Repformer on a dataset based on a large-scale standard dataset, the RT-1 dataset, and on a physical robot platform. The results show that our approach outperformed existing approaches including MLLMs. Our best model achieved an improvement of 8.66 points in accuracy compared to the representative MLLM-based model.

Cite this Paper


BibTeX
@InProceedings{pmlr-v270-goko25a, title = {Task Success Prediction for Open-Vocabulary Manipulation Based on Multi-Level Aligned Representations}, author = {Goko, Miyu and Kambara, Motonari and Saito, Daichi and Otsuki, Seitaro and Sugiura, Komei}, booktitle = {Proceedings of The 8th Conference on Robot Learning}, pages = {3242--3263}, year = {2025}, editor = {Agrawal, Pulkit and Kroemer, Oliver and Burgard, Wolfram}, volume = {270}, series = {Proceedings of Machine Learning Research}, month = {06--09 Nov}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v270/main/assets/goko25a/goko25a.pdf}, url = {https://proceedings.mlr.press/v270/goko25a.html}, abstract = {In this study, we consider the problem of predicting task success for open-vocabulary manipulation by a manipulator, based on instruction sentences and egocentric images before and after manipulation. Conventional approaches, including multimodal large language models (MLLMs), often fail to appropriately understand detailed characteristics of objects and/or subtle changes in the position of objects. We propose Contrastive $\lambda$-Repformer, which predicts task success for table-top manipulation tasks by aligning images with instruction sentences. Our method integrates the following three key types of features into a multi-level aligned representation: features that preserve local image information; features aligned with natural language; and features structured through natural language. This allows the model to focus on important changes by looking at the differences in the representation between two images. We evaluate Contrastive $\lambda$-Repformer on a dataset based on a large-scale standard dataset, the RT-1 dataset, and on a physical robot platform. The results show that our approach outperformed existing approaches including MLLMs. Our best model achieved an improvement of 8.66 points in accuracy compared to the representative MLLM-based model.} }
Endnote
%0 Conference Paper %T Task Success Prediction for Open-Vocabulary Manipulation Based on Multi-Level Aligned Representations %A Miyu Goko %A Motonari Kambara %A Daichi Saito %A Seitaro Otsuki %A Komei Sugiura %B Proceedings of The 8th Conference on Robot Learning %C Proceedings of Machine Learning Research %D 2025 %E Pulkit Agrawal %E Oliver Kroemer %E Wolfram Burgard %F pmlr-v270-goko25a %I PMLR %P 3242--3263 %U https://proceedings.mlr.press/v270/goko25a.html %V 270 %X In this study, we consider the problem of predicting task success for open-vocabulary manipulation by a manipulator, based on instruction sentences and egocentric images before and after manipulation. Conventional approaches, including multimodal large language models (MLLMs), often fail to appropriately understand detailed characteristics of objects and/or subtle changes in the position of objects. We propose Contrastive $\lambda$-Repformer, which predicts task success for table-top manipulation tasks by aligning images with instruction sentences. Our method integrates the following three key types of features into a multi-level aligned representation: features that preserve local image information; features aligned with natural language; and features structured through natural language. This allows the model to focus on important changes by looking at the differences in the representation between two images. We evaluate Contrastive $\lambda$-Repformer on a dataset based on a large-scale standard dataset, the RT-1 dataset, and on a physical robot platform. The results show that our approach outperformed existing approaches including MLLMs. Our best model achieved an improvement of 8.66 points in accuracy compared to the representative MLLM-based model.
APA
Goko, M., Kambara, M., Saito, D., Otsuki, S. & Sugiura, K.. (2025). Task Success Prediction for Open-Vocabulary Manipulation Based on Multi-Level Aligned Representations. Proceedings of The 8th Conference on Robot Learning, in Proceedings of Machine Learning Research 270:3242-3263 Available from https://proceedings.mlr.press/v270/goko25a.html.

Related Material