GC-VLN: Instruction as Graph Constraints for Training-free Vision-and-Language Navigation

Hang Yin, Haoyu Wei, Xiuwei Xu, Wenxuan Guo, Jie Zhou, Jiwen Lu
Proceedings of The 9th Conference on Robot Learning, PMLR 305:1809-1824, 2025.

Abstract

In this paper, we propose a training-free framework for vision-and-language navigation (VLN). Existing zero-shot VLN methods are mainly designed for discrete environments or involve unsupervised training in continuous simulator environments, which makes it challenging to generalize and deploy them in real-world scenarios. To achieve a training-free framework in continuous environments, our framework formulates navigation guidance as graph constraint optimization by decomposing instructions into explicit spatial constraints. The constraint-driven paradigm decodes spatial semantics through constraint solving, enabling zero-shot adaptation to unseen environments. Specifically, we construct a spatial constraint library covering all types of spatial relationship mentioned in VLN instructions. The human instruction is decomposed into a directed acyclic graph, with waypoint nodes, object nodes and edges, which are used as queries to retrieve the library to build the graph constraints. The graph constraint optimization is solved by the constraint solver to determine the positions of waypoints, obtaining the robot’s navigation path and final goal. To handle cases of no solution or multiple solutions, we construct the navigation tree and the backtracking mechanism. Extensive experiments on standard benchmarks demonstrate significant improvements in success rate and navigation efficiency compared to state-of-the-art zero-shot VLN methods. We further conduct real-world experiments to show our framework can effectively generalize to new environments and instruction sets, paving the way for more robust and autonomous navigation framework.

Cite this Paper


BibTeX
@InProceedings{pmlr-v305-yin25a, title = {GC-VLN: Instruction as Graph Constraints for Training-free Vision-and-Language Navigation}, author = {Yin, Hang and Wei, Haoyu and Xu, Xiuwei and Guo, Wenxuan and Zhou, Jie and Lu, Jiwen}, booktitle = {Proceedings of The 9th Conference on Robot Learning}, pages = {1809--1824}, year = {2025}, editor = {Lim, Joseph and Song, Shuran and Park, Hae-Won}, volume = {305}, series = {Proceedings of Machine Learning Research}, month = {27--30 Sep}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v305/main/assets/yin25a/yin25a.pdf}, url = {https://proceedings.mlr.press/v305/yin25a.html}, abstract = {In this paper, we propose a training-free framework for vision-and-language navigation (VLN). Existing zero-shot VLN methods are mainly designed for discrete environments or involve unsupervised training in continuous simulator environments, which makes it challenging to generalize and deploy them in real-world scenarios. To achieve a training-free framework in continuous environments, our framework formulates navigation guidance as graph constraint optimization by decomposing instructions into explicit spatial constraints. The constraint-driven paradigm decodes spatial semantics through constraint solving, enabling zero-shot adaptation to unseen environments. Specifically, we construct a spatial constraint library covering all types of spatial relationship mentioned in VLN instructions. The human instruction is decomposed into a directed acyclic graph, with waypoint nodes, object nodes and edges, which are used as queries to retrieve the library to build the graph constraints. The graph constraint optimization is solved by the constraint solver to determine the positions of waypoints, obtaining the robot’s navigation path and final goal. To handle cases of no solution or multiple solutions, we construct the navigation tree and the backtracking mechanism. Extensive experiments on standard benchmarks demonstrate significant improvements in success rate and navigation efficiency compared to state-of-the-art zero-shot VLN methods. We further conduct real-world experiments to show our framework can effectively generalize to new environments and instruction sets, paving the way for more robust and autonomous navigation framework.} }
Endnote
%0 Conference Paper %T GC-VLN: Instruction as Graph Constraints for Training-free Vision-and-Language Navigation %A Hang Yin %A Haoyu Wei %A Xiuwei Xu %A Wenxuan Guo %A Jie Zhou %A Jiwen Lu %B Proceedings of The 9th Conference on Robot Learning %C Proceedings of Machine Learning Research %D 2025 %E Joseph Lim %E Shuran Song %E Hae-Won Park %F pmlr-v305-yin25a %I PMLR %P 1809--1824 %U https://proceedings.mlr.press/v305/yin25a.html %V 305 %X In this paper, we propose a training-free framework for vision-and-language navigation (VLN). Existing zero-shot VLN methods are mainly designed for discrete environments or involve unsupervised training in continuous simulator environments, which makes it challenging to generalize and deploy them in real-world scenarios. To achieve a training-free framework in continuous environments, our framework formulates navigation guidance as graph constraint optimization by decomposing instructions into explicit spatial constraints. The constraint-driven paradigm decodes spatial semantics through constraint solving, enabling zero-shot adaptation to unseen environments. Specifically, we construct a spatial constraint library covering all types of spatial relationship mentioned in VLN instructions. The human instruction is decomposed into a directed acyclic graph, with waypoint nodes, object nodes and edges, which are used as queries to retrieve the library to build the graph constraints. The graph constraint optimization is solved by the constraint solver to determine the positions of waypoints, obtaining the robot’s navigation path and final goal. To handle cases of no solution or multiple solutions, we construct the navigation tree and the backtracking mechanism. Extensive experiments on standard benchmarks demonstrate significant improvements in success rate and navigation efficiency compared to state-of-the-art zero-shot VLN methods. We further conduct real-world experiments to show our framework can effectively generalize to new environments and instruction sets, paving the way for more robust and autonomous navigation framework.
APA
Yin, H., Wei, H., Xu, X., Guo, W., Zhou, J. & Lu, J.. (2025). GC-VLN: Instruction as Graph Constraints for Training-free Vision-and-Language Navigation. Proceedings of The 9th Conference on Robot Learning, in Proceedings of Machine Learning Research 305:1809-1824 Available from https://proceedings.mlr.press/v305/yin25a.html.

Related Material