SocialNav-SUB: Benchmarking VLMs for Scene Understanding in Social Robot Navigation

Michael Joseph Munje, Chen Tang, Shuijing Liu, Zichao Hu, Yifeng Zhu, Jiaxun Cui, Garrett Warnell, Joydeep Biswas, Peter Stone
Proceedings of The 9th Conference on Robot Learning, PMLR 305:1120-1143, 2025.

Abstract

Robot navigation in dynamic, human-centered environments requires socially-compliant decisions grounded in robust scene understanding, including spatiotemporal awareness and the ability to interpret human intentions. Recent Vision-Language Models (VLMs) show exhibit promising capabilities such as object recognition, common-sense reasoning, and contextual understanding—that align with the nuanced requirements of social robot navigation. However, it remains unclear whether VLMs can reliably perform the complex spatiotemporal reasoning and intent inference needed for safe and socially compliant robot navigation. In this paper, we introduce the Social Navigation Scene Understanding Benchmark (SocialNav-SUB), a Visual Question Answering (VQA) dataset and benchmark designed to evaluate VLMs for scene understanding in real-world social robot navigation scenarios. SocialNav-SUB provides a unified framework for evaluating VLMs against human and rule-based baselines across VQA tasks requiring spatial, spatiotemporal, and social reasoning in social robot navigation. Through experiments with state-of-the-art VLMs, we find that while the best-performing VLM achieves an encouraging probability of agreeing with human answers, it still underperforms a simpler rule-based approach and human consensus, indicating critical gaps in social scene understanding of current VLMs. Our benchmark sets the stage for further research on foundation models for social robot navigation, offering a framework to explore how VLMs can be tailored to meet real-world social robot navigation needs. We will open source the code and release the benchmark.

Cite this Paper


BibTeX
@InProceedings{pmlr-v305-munje25a, title = {SocialNav-SUB: Benchmarking VLMs for Scene Understanding in Social Robot Navigation}, author = {Munje, Michael Joseph and Tang, Chen and Liu, Shuijing and Hu, Zichao and Zhu, Yifeng and Cui, Jiaxun and Warnell, Garrett and Biswas, Joydeep and Stone, Peter}, booktitle = {Proceedings of The 9th Conference on Robot Learning}, pages = {1120--1143}, year = {2025}, editor = {Lim, Joseph and Song, Shuran and Park, Hae-Won}, volume = {305}, series = {Proceedings of Machine Learning Research}, month = {27--30 Sep}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v305/main/assets/munje25a/munje25a.pdf}, url = {https://proceedings.mlr.press/v305/munje25a.html}, abstract = {Robot navigation in dynamic, human-centered environments requires socially-compliant decisions grounded in robust scene understanding, including spatiotemporal awareness and the ability to interpret human intentions. Recent Vision-Language Models (VLMs) show exhibit promising capabilities such as object recognition, common-sense reasoning, and contextual understanding—that align with the nuanced requirements of social robot navigation. However, it remains unclear whether VLMs can reliably perform the complex spatiotemporal reasoning and intent inference needed for safe and socially compliant robot navigation. In this paper, we introduce the Social Navigation Scene Understanding Benchmark (SocialNav-SUB), a Visual Question Answering (VQA) dataset and benchmark designed to evaluate VLMs for scene understanding in real-world social robot navigation scenarios. SocialNav-SUB provides a unified framework for evaluating VLMs against human and rule-based baselines across VQA tasks requiring spatial, spatiotemporal, and social reasoning in social robot navigation. Through experiments with state-of-the-art VLMs, we find that while the best-performing VLM achieves an encouraging probability of agreeing with human answers, it still underperforms a simpler rule-based approach and human consensus, indicating critical gaps in social scene understanding of current VLMs. Our benchmark sets the stage for further research on foundation models for social robot navigation, offering a framework to explore how VLMs can be tailored to meet real-world social robot navigation needs. We will open source the code and release the benchmark.} }
Endnote
%0 Conference Paper %T SocialNav-SUB: Benchmarking VLMs for Scene Understanding in Social Robot Navigation %A Michael Joseph Munje %A Chen Tang %A Shuijing Liu %A Zichao Hu %A Yifeng Zhu %A Jiaxun Cui %A Garrett Warnell %A Joydeep Biswas %A Peter Stone %B Proceedings of The 9th Conference on Robot Learning %C Proceedings of Machine Learning Research %D 2025 %E Joseph Lim %E Shuran Song %E Hae-Won Park %F pmlr-v305-munje25a %I PMLR %P 1120--1143 %U https://proceedings.mlr.press/v305/munje25a.html %V 305 %X Robot navigation in dynamic, human-centered environments requires socially-compliant decisions grounded in robust scene understanding, including spatiotemporal awareness and the ability to interpret human intentions. Recent Vision-Language Models (VLMs) show exhibit promising capabilities such as object recognition, common-sense reasoning, and contextual understanding—that align with the nuanced requirements of social robot navigation. However, it remains unclear whether VLMs can reliably perform the complex spatiotemporal reasoning and intent inference needed for safe and socially compliant robot navigation. In this paper, we introduce the Social Navigation Scene Understanding Benchmark (SocialNav-SUB), a Visual Question Answering (VQA) dataset and benchmark designed to evaluate VLMs for scene understanding in real-world social robot navigation scenarios. SocialNav-SUB provides a unified framework for evaluating VLMs against human and rule-based baselines across VQA tasks requiring spatial, spatiotemporal, and social reasoning in social robot navigation. Through experiments with state-of-the-art VLMs, we find that while the best-performing VLM achieves an encouraging probability of agreeing with human answers, it still underperforms a simpler rule-based approach and human consensus, indicating critical gaps in social scene understanding of current VLMs. Our benchmark sets the stage for further research on foundation models for social robot navigation, offering a framework to explore how VLMs can be tailored to meet real-world social robot navigation needs. We will open source the code and release the benchmark.
APA
Munje, M.J., Tang, C., Liu, S., Hu, Z., Zhu, Y., Cui, J., Warnell, G., Biswas, J. & Stone, P.. (2025). SocialNav-SUB: Benchmarking VLMs for Scene Understanding in Social Robot Navigation. Proceedings of The 9th Conference on Robot Learning, in Proceedings of Machine Learning Research 305:1120-1143 Available from https://proceedings.mlr.press/v305/munje25a.html.

Related Material