Pointing3D: A Benchmark for 3D Object Referral via Pointing Gestures

Mert Arslanoglu, Kadir Yilmaz, Cemhan Kaan Özaltan, Timm Linder, Bastian Leibe
Proceedings of The 9th Conference on Robot Learning, PMLR 305:2170-2183, 2025.

Abstract

Pointing gestures provide a natural and efficient way to communicate spatial information in human-machine interaction, yet their potential for 3D object referral remains largely under-explored. To fill this gap, we introduce the task of pointing-based 3D segmentation. In this task, given an image of a person pointing at an object and the 3D point cloud of the environment, the goal is to predict the 3D segmentation mask of the referred object. To enable the standardized evaluation of this task, we introduce POINTR3D, a curated dataset of over 65,000 frames captured with three cameras across four indoor scenes, featuring diverse pointing scenarios. Each frame is annotated with the information of the active hand, the corresponding object ID, and the 3D segmentation mask of the object. To showcase the application of the proposed dataset, we further introduce Pointing3D, a transformer-based architecture that predicts the pointing direction from RGB images and uses this prediction as a prompt to segment the referred object in the 3D point cloud. Experimental results show that Pointing3D outperforms other strong baselines we introduce and lays the groundwork for future research. The dataset, source code, and evaluation tools will be made publicly available to support further research in this area, enabling a natural human-machine interaction.

Cite this Paper


BibTeX
@InProceedings{pmlr-v305-arslanoglu25a, title = {Pointing3D: A Benchmark for 3D Object Referral via Pointing Gestures}, author = {Arslanoglu, Mert and Yilmaz, Kadir and \"{O}zaltan, Cemhan Kaan and Linder, Timm and Leibe, Bastian}, booktitle = {Proceedings of The 9th Conference on Robot Learning}, pages = {2170--2183}, year = {2025}, editor = {Lim, Joseph and Song, Shuran and Park, Hae-Won}, volume = {305}, series = {Proceedings of Machine Learning Research}, month = {27--30 Sep}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v305/main/assets/arslanoglu25a/arslanoglu25a.pdf}, url = {https://proceedings.mlr.press/v305/arslanoglu25a.html}, abstract = {Pointing gestures provide a natural and efficient way to communicate spatial information in human-machine interaction, yet their potential for 3D object referral remains largely under-explored. To fill this gap, we introduce the task of pointing-based 3D segmentation. In this task, given an image of a person pointing at an object and the 3D point cloud of the environment, the goal is to predict the 3D segmentation mask of the referred object. To enable the standardized evaluation of this task, we introduce POINTR3D, a curated dataset of over 65,000 frames captured with three cameras across four indoor scenes, featuring diverse pointing scenarios. Each frame is annotated with the information of the active hand, the corresponding object ID, and the 3D segmentation mask of the object. To showcase the application of the proposed dataset, we further introduce Pointing3D, a transformer-based architecture that predicts the pointing direction from RGB images and uses this prediction as a prompt to segment the referred object in the 3D point cloud. Experimental results show that Pointing3D outperforms other strong baselines we introduce and lays the groundwork for future research. The dataset, source code, and evaluation tools will be made publicly available to support further research in this area, enabling a natural human-machine interaction.} }
Endnote
%0 Conference Paper %T Pointing3D: A Benchmark for 3D Object Referral via Pointing Gestures %A Mert Arslanoglu %A Kadir Yilmaz %A Cemhan Kaan Özaltan %A Timm Linder %A Bastian Leibe %B Proceedings of The 9th Conference on Robot Learning %C Proceedings of Machine Learning Research %D 2025 %E Joseph Lim %E Shuran Song %E Hae-Won Park %F pmlr-v305-arslanoglu25a %I PMLR %P 2170--2183 %U https://proceedings.mlr.press/v305/arslanoglu25a.html %V 305 %X Pointing gestures provide a natural and efficient way to communicate spatial information in human-machine interaction, yet their potential for 3D object referral remains largely under-explored. To fill this gap, we introduce the task of pointing-based 3D segmentation. In this task, given an image of a person pointing at an object and the 3D point cloud of the environment, the goal is to predict the 3D segmentation mask of the referred object. To enable the standardized evaluation of this task, we introduce POINTR3D, a curated dataset of over 65,000 frames captured with three cameras across four indoor scenes, featuring diverse pointing scenarios. Each frame is annotated with the information of the active hand, the corresponding object ID, and the 3D segmentation mask of the object. To showcase the application of the proposed dataset, we further introduce Pointing3D, a transformer-based architecture that predicts the pointing direction from RGB images and uses this prediction as a prompt to segment the referred object in the 3D point cloud. Experimental results show that Pointing3D outperforms other strong baselines we introduce and lays the groundwork for future research. The dataset, source code, and evaluation tools will be made publicly available to support further research in this area, enabling a natural human-machine interaction.
APA
Arslanoglu, M., Yilmaz, K., Özaltan, C.K., Linder, T. & Leibe, B.. (2025). Pointing3D: A Benchmark for 3D Object Referral via Pointing Gestures. Proceedings of The 9th Conference on Robot Learning, in Proceedings of Machine Learning Research 305:2170-2183 Available from https://proceedings.mlr.press/v305/arslanoglu25a.html.

Related Material