End-to-End Navigation with Vision-Language Models: Transforming Spatial Reasoning into Question-Answering

Dylan Goetting, Himanshu Gaurav Singh, Antonio Loquercio
Proceedings of the International Conference on Neuro-symbolic Systems, PMLR 288:22-35, 2025.

Abstract

We present VLMnav, an embodied framework to transform a Vision-Language Model (VLM) into an end-to-end navigation policy. In contrast to prior work, we do not rely on a separation between perception, planning, and control; instead, we use a VLM to directly select actions in one step. Surprisingly, we find that a VLM can be used as an end-to-end policy zero-shot, i.e., without any fine-tuning or exposure to navigation data. This makes our approach open-ended and generalizable to any downstream navigation task. We run an extensive study to evaluate the performance of our approach in comparison to baseline prompting methods. In addition, we perform a design analysis to understand the most impactful design decisions. Visual examples and code for our project can be found at jirl-upenn.github.io/VLMnav/.

Cite this Paper


BibTeX
@InProceedings{pmlr-v288-goetting25a, title = {End-to-End Navigation with Vision-Language Models: Transforming Spatial Reasoning into Question-Answering}, author = {Goetting, Dylan and Singh, Himanshu Gaurav and Loquercio, Antonio}, booktitle = {Proceedings of the International Conference on Neuro-symbolic Systems}, pages = {22--35}, year = {2025}, editor = {Pappas, George and Ravikumar, Pradeep and Seshia, Sanjit A.}, volume = {288}, series = {Proceedings of Machine Learning Research}, month = {28--30 May}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v288/main/assets/goetting25a/goetting25a.pdf}, url = {https://proceedings.mlr.press/v288/goetting25a.html}, abstract = {We present VLMnav, an embodied framework to transform a Vision-Language Model (VLM) into an end-to-end navigation policy. In contrast to prior work, we do not rely on a separation between perception, planning, and control; instead, we use a VLM to directly select actions in one step. Surprisingly, we find that a VLM can be used as an end-to-end policy zero-shot, i.e., without any fine-tuning or exposure to navigation data. This makes our approach open-ended and generalizable to any downstream navigation task. We run an extensive study to evaluate the performance of our approach in comparison to baseline prompting methods. In addition, we perform a design analysis to understand the most impactful design decisions. Visual examples and code for our project can be found at jirl-upenn.github.io/VLMnav/.} }
Endnote
%0 Conference Paper %T End-to-End Navigation with Vision-Language Models: Transforming Spatial Reasoning into Question-Answering %A Dylan Goetting %A Himanshu Gaurav Singh %A Antonio Loquercio %B Proceedings of the International Conference on Neuro-symbolic Systems %C Proceedings of Machine Learning Research %D 2025 %E George Pappas %E Pradeep Ravikumar %E Sanjit A. Seshia %F pmlr-v288-goetting25a %I PMLR %P 22--35 %U https://proceedings.mlr.press/v288/goetting25a.html %V 288 %X We present VLMnav, an embodied framework to transform a Vision-Language Model (VLM) into an end-to-end navigation policy. In contrast to prior work, we do not rely on a separation between perception, planning, and control; instead, we use a VLM to directly select actions in one step. Surprisingly, we find that a VLM can be used as an end-to-end policy zero-shot, i.e., without any fine-tuning or exposure to navigation data. This makes our approach open-ended and generalizable to any downstream navigation task. We run an extensive study to evaluate the performance of our approach in comparison to baseline prompting methods. In addition, we perform a design analysis to understand the most impactful design decisions. Visual examples and code for our project can be found at jirl-upenn.github.io/VLMnav/.
APA
Goetting, D., Singh, H.G. & Loquercio, A.. (2025). End-to-End Navigation with Vision-Language Models: Transforming Spatial Reasoning into Question-Answering. Proceedings of the International Conference on Neuro-symbolic Systems, in Proceedings of Machine Learning Research 288:22-35 Available from https://proceedings.mlr.press/v288/goetting25a.html.

Related Material