LOCATE 3D: Real-World Object Localization via Self-Supervised Learning in 3D

Paul Mcvay; Sergio Arnaud; Ada Martin; Arjun Majumdar; Krishna Murthy Jatavallabhula; Phillip Thomas; Ruslan Partsey; Daniel Dugas; Abha Gejji; Alexander Sax; Vincent-Pierre Berges; Mikael Henaff; Ayush Jain; Ang Cao; Ishita Prasad; Mrinal Kalakrishnan; Michael Rabbat; Nicolas Ballas; Mido Assran; Oleksandr Maksymets; Aravind Rajeswaran; Franziska Meier

LOCATE 3D: Real-World Object Localization via Self-Supervised Learning in 3D

Paul Mcvay, Sergio Arnaud, Ada Martin, Arjun Majumdar, Krishna Murthy Jatavallabhula, Phillip Thomas, Ruslan Partsey, Daniel Dugas, Abha Gejji, Alexander Sax, Vincent-Pierre Berges, Mikael Henaff, Ayush Jain, Ang Cao, Ishita Prasad, Mrinal Kalakrishnan, Michael Rabbat, Nicolas Ballas, Mido Assran, Oleksandr Maksymets, Aravind Rajeswaran, Franziska Meier

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:43476-43502, 2025.

Abstract

We present LOCATE 3D, a model for localizing objects in 3D scenes from referring expressions like "the small coffee table between the sofa and the lamp." LOCATE 3D sets a new state-of-the-art on standard referential grounding benchmarks and showcases robust generalization capabilities. Notably, LOCATE 3D operates directly on sensor observation streams (posed RGB-D frames), enabling real-world deployment on robots and AR devices. Key to our approach is 3D-JEPA, a novel self-supervised learning (SSL) algorithm applicable to sensor point clouds. It takes as input a 3D pointcloud featurized using 2D foundation models (CLIP, DINO). Subsequently, masked prediction in latent space is employed as a pretext task to aid the self-supervised learning of contextualized pointcloud features. Once trained, the 3D-JEPA encoder is finetuned alongside a language-conditioned decoder to jointly predict 3D masks and bounding boxes. Additionally, we introduce LOCATE 3D DATASET, a new dataset for 3D referential grounding, spanning multiple capture setups with over 130K annotations. This enables a systematic study of generalization capabilities as well as a stronger model. Code, models and dataset can be found at the project website: locate3d.atmeta.com

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-mcvay25a,
  title = 	 {{LOCATE} 3{D}: Real-World Object Localization via Self-Supervised Learning in 3{D}},
  author =       {Mcvay, Paul and Arnaud, Sergio and Martin, Ada and Majumdar, Arjun and Jatavallabhula, Krishna Murthy and Thomas, Phillip and Partsey, Ruslan and Dugas, Daniel and Gejji, Abha and Sax, Alexander and Berges, Vincent-Pierre and Henaff, Mikael and Jain, Ayush and Cao, Ang and Prasad, Ishita and Kalakrishnan, Mrinal and Rabbat, Michael and Ballas, Nicolas and Assran, Mido and Maksymets, Oleksandr and Rajeswaran, Aravind and Meier, Franziska},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {43476--43502},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/mcvay25a/mcvay25a.pdf},
  url = 	 {https://proceedings.mlr.press/v267/mcvay25a.html},
  abstract = 	 {We present LOCATE 3D, a model for localizing objects in 3D scenes from referring expressions like "the small coffee table between the sofa and the lamp." LOCATE 3D sets a new state-of-the-art on standard referential grounding benchmarks and showcases robust generalization capabilities. Notably, LOCATE 3D operates directly on sensor observation streams (posed RGB-D frames), enabling real-world deployment on robots and AR devices. Key to our approach is 3D-JEPA, a novel self-supervised learning (SSL) algorithm applicable to sensor point clouds. It takes as input a 3D pointcloud featurized using 2D foundation models (CLIP, DINO). Subsequently, masked prediction in latent space is employed as a pretext task to aid the self-supervised learning of contextualized pointcloud features. Once trained, the 3D-JEPA encoder is finetuned alongside a language-conditioned decoder to jointly predict 3D masks and bounding boxes. Additionally, we introduce LOCATE 3D DATASET, a new dataset for 3D referential grounding, spanning multiple capture setups with over 130K annotations. This enables a systematic study of generalization capabilities as well as a stronger model. Code, models and dataset can be found at the project website: locate3d.atmeta.com}
}

Endnote

%0 Conference Paper
%T LOCATE 3D: Real-World Object Localization via Self-Supervised Learning in 3D
%A Paul Mcvay
%A Sergio Arnaud
%A Ada Martin
%A Arjun Majumdar
%A Krishna Murthy Jatavallabhula
%A Phillip Thomas
%A Ruslan Partsey
%A Daniel Dugas
%A Abha Gejji
%A Alexander Sax
%A Vincent-Pierre Berges
%A Mikael Henaff
%A Ayush Jain
%A Ang Cao
%A Ishita Prasad
%A Mrinal Kalakrishnan
%A Michael Rabbat
%A Nicolas Ballas
%A Mido Assran
%A Oleksandr Maksymets
%A Aravind Rajeswaran
%A Franziska Meier
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-mcvay25a
%I PMLR
%P 43476--43502
%U https://proceedings.mlr.press/v267/mcvay25a.html
%V 267
%X We present LOCATE 3D, a model for localizing objects in 3D scenes from referring expressions like "the small coffee table between the sofa and the lamp." LOCATE 3D sets a new state-of-the-art on standard referential grounding benchmarks and showcases robust generalization capabilities. Notably, LOCATE 3D operates directly on sensor observation streams (posed RGB-D frames), enabling real-world deployment on robots and AR devices. Key to our approach is 3D-JEPA, a novel self-supervised learning (SSL) algorithm applicable to sensor point clouds. It takes as input a 3D pointcloud featurized using 2D foundation models (CLIP, DINO). Subsequently, masked prediction in latent space is employed as a pretext task to aid the self-supervised learning of contextualized pointcloud features. Once trained, the 3D-JEPA encoder is finetuned alongside a language-conditioned decoder to jointly predict 3D masks and bounding boxes. Additionally, we introduce LOCATE 3D DATASET, a new dataset for 3D referential grounding, spanning multiple capture setups with over 130K annotations. This enables a systematic study of generalization capabilities as well as a stronger model. Code, models and dataset can be found at the project website: locate3d.atmeta.com

APA

Mcvay, P., Arnaud, S., Martin, A., Majumdar, A., Jatavallabhula, K.M., Thomas, P., Partsey, R., Dugas, D., Gejji, A., Sax, A., Berges, V., Henaff, M., Jain, A., Cao, A., Prasad, I., Kalakrishnan, M., Rabbat, M., Ballas, N., Assran, M., Maksymets, O., Rajeswaran, A. & Meier, F.. (2025). LOCATE 3D: Real-World Object Localization via Self-Supervised Learning in 3D. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:43476-43502 Available from https://proceedings.mlr.press/v267/mcvay25a.html.

LOCATE 3D: Real-World Object Localization via Self-Supervised Learning in 3D

Abstract

Cite this Paper

Related Material