From Thousands to Billions: 3D Visual Language Grounding via Render-Supervised Distillation from 2D VLMs

Ang Cao; Sergio Arnaud; Oleksandr Maksymets; Jianing Yang; Ayush Jain; Ada Martin; Vincent-Pierre Berges; Paul Mcvay; Ruslan Partsey; Aravind Rajeswaran; Franziska Meier; Justin Johnson; Jeong Joon Park; Alexander Sax

From Thousands to Billions: 3D Visual Language Grounding via Render-Supervised Distillation from 2D VLMs

Ang Cao, Sergio Arnaud, Oleksandr Maksymets, Jianing Yang, Ayush Jain, Ada Martin, Vincent-Pierre Berges, Paul Mcvay, Ruslan Partsey, Aravind Rajeswaran, Franziska Meier, Justin Johnson, Jeong Joon Park, Alexander Sax

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:6505-6521, 2025.

Abstract

3D vision-language grounding faces a fundamental data bottleneck: while 2D models train on billions of images, 3D models have access to only thousands of labeled scenes–a six-order-of-magnitude gap that severely limits performance. We introduce LIFT-GS, a practical distillation technique that overcomes this limitation by using differentiable rendering to bridge 3D and 2D supervision. LIFT-GS predicts 3D Gaussian representations from point clouds and uses them to render predicted language-conditioned 3D masks into 2D views, enabling supervision from 2D foundation models (SAM, CLIP, LLaMA) without requiring any 3D annotations. This render-supervised formulation enables end-to-end training of complete encoder-decoder architectures and is inherently model-agnostic. LIFT-GS achieves state-of-the-art results with 25.7% mAP on open-vocabulary instance segmentation (vs. 20.2% prior SOTA) and consistent 10-30% improvements on referential grounding tasks. Remarkably, pretraining effectively multiplies fine-tuning datasets by 2$\times$, demonstrating strong scaling properties that suggest 3D VLG currently operates in a severely data-scarce regime. Project page: https://liftgs.github.io.

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-cao25a,
  title = 	 {From Thousands to Billions: 3{D} Visual Language Grounding via Render-Supervised Distillation from 2{D} {VLM}s},
  author =       {Cao, Ang and Arnaud, Sergio and Maksymets, Oleksandr and Yang, Jianing and Jain, Ayush and Martin, Ada and Berges, Vincent-Pierre and Mcvay, Paul and Partsey, Ruslan and Rajeswaran, Aravind and Meier, Franziska and Johnson, Justin and Park, Jeong Joon and Sax, Alexander},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {6505--6521},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/cao25a/cao25a.pdf},
  url = 	 {https://proceedings.mlr.press/v267/cao25a.html},
  abstract = 	 {3D vision-language grounding faces a fundamental data bottleneck: while 2D models train on billions of images, 3D models have access to only thousands of labeled scenes–a six-order-of-magnitude gap that severely limits performance. We introduce LIFT-GS, a practical distillation technique that overcomes this limitation by using differentiable rendering to bridge 3D and 2D supervision. LIFT-GS predicts 3D Gaussian representations from point clouds and uses them to render predicted language-conditioned 3D masks into 2D views, enabling supervision from 2D foundation models (SAM, CLIP, LLaMA) without requiring any 3D annotations. This render-supervised formulation enables end-to-end training of complete encoder-decoder architectures and is inherently model-agnostic. LIFT-GS achieves state-of-the-art results with 25.7% mAP on open-vocabulary instance segmentation (vs. 20.2% prior SOTA) and consistent 10-30% improvements on referential grounding tasks. Remarkably, pretraining effectively multiplies fine-tuning datasets by 2$\times$, demonstrating strong scaling properties that suggest 3D VLG currently operates in a severely data-scarce regime. Project page: https://liftgs.github.io.}
}

Endnote

%0 Conference Paper
%T From Thousands to Billions: 3D Visual Language Grounding via Render-Supervised Distillation from 2D VLMs
%A Ang Cao
%A Sergio Arnaud
%A Oleksandr Maksymets
%A Jianing Yang
%A Ayush Jain
%A Ada Martin
%A Vincent-Pierre Berges
%A Paul Mcvay
%A Ruslan Partsey
%A Aravind Rajeswaran
%A Franziska Meier
%A Justin Johnson
%A Jeong Joon Park
%A Alexander Sax
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-cao25a
%I PMLR
%P 6505--6521
%U https://proceedings.mlr.press/v267/cao25a.html
%V 267
%X 3D vision-language grounding faces a fundamental data bottleneck: while 2D models train on billions of images, 3D models have access to only thousands of labeled scenes–a six-order-of-magnitude gap that severely limits performance. We introduce LIFT-GS, a practical distillation technique that overcomes this limitation by using differentiable rendering to bridge 3D and 2D supervision. LIFT-GS predicts 3D Gaussian representations from point clouds and uses them to render predicted language-conditioned 3D masks into 2D views, enabling supervision from 2D foundation models (SAM, CLIP, LLaMA) without requiring any 3D annotations. This render-supervised formulation enables end-to-end training of complete encoder-decoder architectures and is inherently model-agnostic. LIFT-GS achieves state-of-the-art results with 25.7% mAP on open-vocabulary instance segmentation (vs. 20.2% prior SOTA) and consistent 10-30% improvements on referential grounding tasks. Remarkably, pretraining effectively multiplies fine-tuning datasets by 2$\times$, demonstrating strong scaling properties that suggest 3D VLG currently operates in a severely data-scarce regime. Project page: https://liftgs.github.io.

APA

Cao, A., Arnaud, S., Maksymets, O., Yang, J., Jain, A., Martin, A., Berges, V., Mcvay, P., Partsey, R., Rajeswaran, A., Meier, F., Johnson, J., Park, J.J. & Sax, A.. (2025). From Thousands to Billions: 3D Visual Language Grounding via Render-Supervised Distillation from 2D VLMs. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:6505-6521 Available from https://proceedings.mlr.press/v267/cao25a.html.

From Thousands to Billions: 3D Visual Language Grounding via Render-Supervised Distillation from 2D VLMs

Abstract

Cite this Paper

Related Material