Long-VLA: Unleashing Long-Horizon Capability of Vision Language Action Model for Robot Manipulation

Yiguo Fan, Shuanghao Bai, Xinyang Tong, Pengxiang Ding, Yuyang Zhu, Hongchao Lu, Fengqi Dai, Wei Zhao, Yang Liu, Siteng Huang, Zhaoxin Fan, Badong Chen, Donglin Wang
Proceedings of The 9th Conference on Robot Learning, PMLR 305:2018-2037, 2025.

Abstract

Vision-Language-Action (VLA) models have become a cornerstone in robotic policy learning, leveraging large-scale multimodal data for robust and scalable control. However, existing VLA frameworks primarily address short-horizon tasks, and their effectiveness on long-horizon, multi-step robotic manipulation remains limited due to challenges in skill chaining and subtask dependencies. In this work, we introduce Long-VLA, the first end-to-end VLA model specifically designed for long-horizon robotic tasks. Our approach features a novel phase-aware input masking strategy that adaptively segments each subtask into moving and interaction phases, enabling the model to focus on phase-relevant sensory cues and enhancing subtask compatibility. This unified strategy preserves the scalability and data efficiency of VLA training, and our architecture-agnostic module can be seamlessly integrated into existing VLA models. We further propose the L-CALVIN benchmark to systematically evaluate long-horizon manipulation. Extensive experiments on both simulated and real-world tasks demonstrate that Long-VLA significantly outperforms prior state-of-the-art methods, establishing a new baseline for long-horizon robotic control.

Cite this Paper


BibTeX
@InProceedings{pmlr-v305-fan25a, title = {Long-VLA: Unleashing Long-Horizon Capability of Vision Language Action Model for Robot Manipulation}, author = {Fan, Yiguo and Bai, Shuanghao and Tong, Xinyang and Ding, Pengxiang and Zhu, Yuyang and Lu, Hongchao and Dai, Fengqi and Zhao, Wei and Liu, Yang and Huang, Siteng and Fan, Zhaoxin and Chen, Badong and Wang, Donglin}, booktitle = {Proceedings of The 9th Conference on Robot Learning}, pages = {2018--2037}, year = {2025}, editor = {Lim, Joseph and Song, Shuran and Park, Hae-Won}, volume = {305}, series = {Proceedings of Machine Learning Research}, month = {27--30 Sep}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v305/main/assets/fan25a/fan25a.pdf}, url = {https://proceedings.mlr.press/v305/fan25a.html}, abstract = {Vision-Language-Action (VLA) models have become a cornerstone in robotic policy learning, leveraging large-scale multimodal data for robust and scalable control. However, existing VLA frameworks primarily address short-horizon tasks, and their effectiveness on long-horizon, multi-step robotic manipulation remains limited due to challenges in skill chaining and subtask dependencies. In this work, we introduce Long-VLA, the first end-to-end VLA model specifically designed for long-horizon robotic tasks. Our approach features a novel phase-aware input masking strategy that adaptively segments each subtask into moving and interaction phases, enabling the model to focus on phase-relevant sensory cues and enhancing subtask compatibility. This unified strategy preserves the scalability and data efficiency of VLA training, and our architecture-agnostic module can be seamlessly integrated into existing VLA models. We further propose the L-CALVIN benchmark to systematically evaluate long-horizon manipulation. Extensive experiments on both simulated and real-world tasks demonstrate that Long-VLA significantly outperforms prior state-of-the-art methods, establishing a new baseline for long-horizon robotic control.} }
Endnote
%0 Conference Paper %T Long-VLA: Unleashing Long-Horizon Capability of Vision Language Action Model for Robot Manipulation %A Yiguo Fan %A Shuanghao Bai %A Xinyang Tong %A Pengxiang Ding %A Yuyang Zhu %A Hongchao Lu %A Fengqi Dai %A Wei Zhao %A Yang Liu %A Siteng Huang %A Zhaoxin Fan %A Badong Chen %A Donglin Wang %B Proceedings of The 9th Conference on Robot Learning %C Proceedings of Machine Learning Research %D 2025 %E Joseph Lim %E Shuran Song %E Hae-Won Park %F pmlr-v305-fan25a %I PMLR %P 2018--2037 %U https://proceedings.mlr.press/v305/fan25a.html %V 305 %X Vision-Language-Action (VLA) models have become a cornerstone in robotic policy learning, leveraging large-scale multimodal data for robust and scalable control. However, existing VLA frameworks primarily address short-horizon tasks, and their effectiveness on long-horizon, multi-step robotic manipulation remains limited due to challenges in skill chaining and subtask dependencies. In this work, we introduce Long-VLA, the first end-to-end VLA model specifically designed for long-horizon robotic tasks. Our approach features a novel phase-aware input masking strategy that adaptively segments each subtask into moving and interaction phases, enabling the model to focus on phase-relevant sensory cues and enhancing subtask compatibility. This unified strategy preserves the scalability and data efficiency of VLA training, and our architecture-agnostic module can be seamlessly integrated into existing VLA models. We further propose the L-CALVIN benchmark to systematically evaluate long-horizon manipulation. Extensive experiments on both simulated and real-world tasks demonstrate that Long-VLA significantly outperforms prior state-of-the-art methods, establishing a new baseline for long-horizon robotic control.
APA
Fan, Y., Bai, S., Tong, X., Ding, P., Zhu, Y., Lu, H., Dai, F., Zhao, W., Liu, Y., Huang, S., Fan, Z., Chen, B. & Wang, D.. (2025). Long-VLA: Unleashing Long-Horizon Capability of Vision Language Action Model for Robot Manipulation. Proceedings of The 9th Conference on Robot Learning, in Proceedings of Machine Learning Research 305:2018-2037 Available from https://proceedings.mlr.press/v305/fan25a.html.

Related Material