GraphEQA: Using 3D Semantic Scene Graphs for Real-time Embodied Question Answering

Saumya Saxena, Blake Buchanan, Chris Paxton, Peiqi Liu, Bingqing Chen, Narunas Vaskevicius, Luigi Palmieri, Jonathan Francis, Oliver Kroemer
Proceedings of The 9th Conference on Robot Learning, PMLR 305:2714-2742, 2025.

Abstract

In Embodied Question Answering (EQA), agents must explore and develop a semantic understanding of an unseen environment in order to answer a situated question with confidence. This remains a challenging problem in robotics, due to the difficulties in obtaining useful semantic representations, updating these representations online, and leveraging prior world knowledge for efficient exploration and planning. Aiming to address these limitations, we propose GraphEQA, a novel approach that utilizes real-time 3D metric-semantic scene graphs (3DSGs) and task relevant images as multi-modal memory for grounding Vision-Language Models (VLMs) to perform EQA tasks in unseen environments. We employ a hierarchical planning approach that exploits the hierarchical nature of 3DSGs for structured planning and semantic-guided exploration. We evaluate GraphEQA in simulation on two benchmark datasets, HM-EQA and OpenEQA, and demonstrate that it outperforms key baselines by completing EQA tasks with higher success rates and fewer planning steps, and further demonstrate GraphEQA in two separate real world environments. Videos and code are available at https://grapheqa.github.io.

Cite this Paper


BibTeX
@InProceedings{pmlr-v305-saxena25a, title = {GraphEQA: Using 3D Semantic Scene Graphs for Real-time Embodied Question Answering}, author = {Saxena, Saumya and Buchanan, Blake and Paxton, Chris and Liu, Peiqi and Chen, Bingqing and Vaskevicius, Narunas and Palmieri, Luigi and Francis, Jonathan and Kroemer, Oliver}, booktitle = {Proceedings of The 9th Conference on Robot Learning}, pages = {2714--2742}, year = {2025}, editor = {Lim, Joseph and Song, Shuran and Park, Hae-Won}, volume = {305}, series = {Proceedings of Machine Learning Research}, month = {27--30 Sep}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v305/main/assets/saxena25a/saxena25a.pdf}, url = {https://proceedings.mlr.press/v305/saxena25a.html}, abstract = {In Embodied Question Answering (EQA), agents must explore and develop a semantic understanding of an unseen environment in order to answer a situated question with confidence. This remains a challenging problem in robotics, due to the difficulties in obtaining useful semantic representations, updating these representations online, and leveraging prior world knowledge for efficient exploration and planning. Aiming to address these limitations, we propose GraphEQA, a novel approach that utilizes real-time 3D metric-semantic scene graphs (3DSGs) and task relevant images as multi-modal memory for grounding Vision-Language Models (VLMs) to perform EQA tasks in unseen environments. We employ a hierarchical planning approach that exploits the hierarchical nature of 3DSGs for structured planning and semantic-guided exploration. We evaluate GraphEQA in simulation on two benchmark datasets, HM-EQA and OpenEQA, and demonstrate that it outperforms key baselines by completing EQA tasks with higher success rates and fewer planning steps, and further demonstrate GraphEQA in two separate real world environments. Videos and code are available at https://grapheqa.github.io.} }
Endnote
%0 Conference Paper %T GraphEQA: Using 3D Semantic Scene Graphs for Real-time Embodied Question Answering %A Saumya Saxena %A Blake Buchanan %A Chris Paxton %A Peiqi Liu %A Bingqing Chen %A Narunas Vaskevicius %A Luigi Palmieri %A Jonathan Francis %A Oliver Kroemer %B Proceedings of The 9th Conference on Robot Learning %C Proceedings of Machine Learning Research %D 2025 %E Joseph Lim %E Shuran Song %E Hae-Won Park %F pmlr-v305-saxena25a %I PMLR %P 2714--2742 %U https://proceedings.mlr.press/v305/saxena25a.html %V 305 %X In Embodied Question Answering (EQA), agents must explore and develop a semantic understanding of an unseen environment in order to answer a situated question with confidence. This remains a challenging problem in robotics, due to the difficulties in obtaining useful semantic representations, updating these representations online, and leveraging prior world knowledge for efficient exploration and planning. Aiming to address these limitations, we propose GraphEQA, a novel approach that utilizes real-time 3D metric-semantic scene graphs (3DSGs) and task relevant images as multi-modal memory for grounding Vision-Language Models (VLMs) to perform EQA tasks in unseen environments. We employ a hierarchical planning approach that exploits the hierarchical nature of 3DSGs for structured planning and semantic-guided exploration. We evaluate GraphEQA in simulation on two benchmark datasets, HM-EQA and OpenEQA, and demonstrate that it outperforms key baselines by completing EQA tasks with higher success rates and fewer planning steps, and further demonstrate GraphEQA in two separate real world environments. Videos and code are available at https://grapheqa.github.io.
APA
Saxena, S., Buchanan, B., Paxton, C., Liu, P., Chen, B., Vaskevicius, N., Palmieri, L., Francis, J. & Kroemer, O.. (2025). GraphEQA: Using 3D Semantic Scene Graphs for Real-time Embodied Question Answering. Proceedings of The 9th Conference on Robot Learning, in Proceedings of Machine Learning Research 305:2714-2742 Available from https://proceedings.mlr.press/v305/saxena25a.html.

Related Material