End-to-End Navigation with Vision-Language Models: Transforming Spatial Reasoning into Question-Answering

Dylan Goetting; Himanshu Gaurav Singh; Antonio Loquercio

End-to-End Navigation with Vision-Language Models: Transforming Spatial Reasoning into Question-Answering

Dylan Goetting, Himanshu Gaurav Singh, Antonio Loquercio

Proceedings of the International Conference on Neuro-symbolic Systems, PMLR 288:22-35, 2025.

Abstract

We present VLMnav, an embodied framework to transform a Vision-Language Model (VLM) into an end-to-end navigation policy. In contrast to prior work, we do not rely on a separation between perception, planning, and control; instead, we use a VLM to directly select actions in one step. Surprisingly, we find that a VLM can be used as an end-to-end policy zero-shot, i.e., without any fine-tuning or exposure to navigation data. This makes our approach open-ended and generalizable to any downstream navigation task. We run an extensive study to evaluate the performance of our approach in comparison to baseline prompting methods. In addition, we perform a design analysis to understand the most impactful design decisions. Visual examples and code for our project can be found at jirl-upenn.github.io/VLMnav/.

Cite this Paper

BibTeX

@InProceedings{pmlr-v288-goetting25a,
  title = 	 {End-to-End Navigation with Vision-Language Models: Transforming Spatial Reasoning into Question-Answering},
  author =       {Goetting, Dylan and Singh, Himanshu Gaurav and Loquercio, Antonio},
  booktitle = 	 {Proceedings of the International Conference on Neuro-symbolic Systems},
  pages = 	 {22--35},
  year = 	 {2025},
  editor = 	 {Pappas, George and Ravikumar, Pradeep and Seshia, Sanjit A.},
  volume = 	 {288},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {28--30 May},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v288/main/assets/goetting25a/goetting25a.pdf},
  url = 	 {https://proceedings.mlr.press/v288/goetting25a.html},
  abstract = 	 {We present VLMnav, an embodied framework to transform a Vision-Language Model (VLM) into an end-to-end navigation policy. In contrast to prior work, we do not rely on a separation between perception, planning, and control; instead, we use a VLM to directly select actions in one step. Surprisingly, we find that a VLM can be used as an end-to-end policy zero-shot, i.e., without any fine-tuning or exposure to navigation data. This makes our approach open-ended and generalizable to any downstream navigation task. We run an extensive study to evaluate the performance of our approach in comparison to baseline prompting methods. In addition, we perform a design analysis to understand the most impactful design decisions. Visual examples and code for our project can be found at jirl-upenn.github.io/VLMnav/.}
}

Endnote

%0 Conference Paper
%T End-to-End Navigation with Vision-Language Models: Transforming Spatial Reasoning into Question-Answering
%A Dylan Goetting
%A Himanshu Gaurav Singh
%A Antonio Loquercio
%B Proceedings of the International Conference on Neuro-symbolic Systems
%C Proceedings of Machine Learning Research
%D 2025
%E George Pappas
%E Pradeep Ravikumar
%E Sanjit A. Seshia	
%F pmlr-v288-goetting25a
%I PMLR
%P 22--35
%U https://proceedings.mlr.press/v288/goetting25a.html
%V 288
%X We present VLMnav, an embodied framework to transform a Vision-Language Model (VLM) into an end-to-end navigation policy. In contrast to prior work, we do not rely on a separation between perception, planning, and control; instead, we use a VLM to directly select actions in one step. Surprisingly, we find that a VLM can be used as an end-to-end policy zero-shot, i.e., without any fine-tuning or exposure to navigation data. This makes our approach open-ended and generalizable to any downstream navigation task. We run an extensive study to evaluate the performance of our approach in comparison to baseline prompting methods. In addition, we perform a design analysis to understand the most impactful design decisions. Visual examples and code for our project can be found at jirl-upenn.github.io/VLMnav/.

APA

Goetting, D., Singh, H.G. & Loquercio, A.. (2025). End-to-End Navigation with Vision-Language Models: Transforming Spatial Reasoning into Question-Answering. Proceedings of the International Conference on Neuro-symbolic Systems, in Proceedings of Machine Learning Research 288:22-35 Available from https://proceedings.mlr.press/v288/goetting25a.html.

Related Material

Download PDF