FOMO-3D: Using Vision Foundation Models for Long-Tailed 3D Object Detection

Anqi Joyce Yang, James Tu, Nikita Dvornik, Enxu Li, Raquel Urtasun
Proceedings of The 9th Conference on Robot Learning, PMLR 305:5526-5556, 2025.

Abstract

In order to navigate complex traffic environments, self-driving vehicles must recognize many semantic classes pertaining to vulnerable road users or traffic control devices. However, many safety-critical objects (e.g., construction worker) appear infrequently in nominal traffic conditions, leading to a severe shortage of training examples from driving data alone. Recent vision foundation models, which are trained on a large corpus of data, can serve as a good source of external prior knowledge to improve generalization. We propose FOMO-3D, the first 3D detector to leverage vision foundation models for long-tailed 3D detection. Specifically, FOMO-3D exploits rich semantic and depth priors from OWLv2 and Metric3Dv2 within a two-stage detection paradigm that first generates proposals with a LiDAR-based branch and a novel camera-based branch, and refines them with attention especially to image features from OWL. Evaluations on real-world driving data show that using rich priors from vision foundation models with careful multimodal fusion designs leads to large gains for long-tailed 3D detection.

Cite this Paper


BibTeX
@InProceedings{pmlr-v305-yang25e, title = {FOMO-3D: Using Vision Foundation Models for Long-Tailed 3D Object Detection}, author = {Yang, Anqi Joyce and Tu, James and Dvornik, Nikita and Li, Enxu and Urtasun, Raquel}, booktitle = {Proceedings of The 9th Conference on Robot Learning}, pages = {5526--5556}, year = {2025}, editor = {Lim, Joseph and Song, Shuran and Park, Hae-Won}, volume = {305}, series = {Proceedings of Machine Learning Research}, month = {27--30 Sep}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v305/main/assets/yang25e/yang25e.pdf}, url = {https://proceedings.mlr.press/v305/yang25e.html}, abstract = {In order to navigate complex traffic environments, self-driving vehicles must recognize many semantic classes pertaining to vulnerable road users or traffic control devices. However, many safety-critical objects (e.g., construction worker) appear infrequently in nominal traffic conditions, leading to a severe shortage of training examples from driving data alone. Recent vision foundation models, which are trained on a large corpus of data, can serve as a good source of external prior knowledge to improve generalization. We propose FOMO-3D, the first 3D detector to leverage vision foundation models for long-tailed 3D detection. Specifically, FOMO-3D exploits rich semantic and depth priors from OWLv2 and Metric3Dv2 within a two-stage detection paradigm that first generates proposals with a LiDAR-based branch and a novel camera-based branch, and refines them with attention especially to image features from OWL. Evaluations on real-world driving data show that using rich priors from vision foundation models with careful multimodal fusion designs leads to large gains for long-tailed 3D detection.} }
Endnote
%0 Conference Paper %T FOMO-3D: Using Vision Foundation Models for Long-Tailed 3D Object Detection %A Anqi Joyce Yang %A James Tu %A Nikita Dvornik %A Enxu Li %A Raquel Urtasun %B Proceedings of The 9th Conference on Robot Learning %C Proceedings of Machine Learning Research %D 2025 %E Joseph Lim %E Shuran Song %E Hae-Won Park %F pmlr-v305-yang25e %I PMLR %P 5526--5556 %U https://proceedings.mlr.press/v305/yang25e.html %V 305 %X In order to navigate complex traffic environments, self-driving vehicles must recognize many semantic classes pertaining to vulnerable road users or traffic control devices. However, many safety-critical objects (e.g., construction worker) appear infrequently in nominal traffic conditions, leading to a severe shortage of training examples from driving data alone. Recent vision foundation models, which are trained on a large corpus of data, can serve as a good source of external prior knowledge to improve generalization. We propose FOMO-3D, the first 3D detector to leverage vision foundation models for long-tailed 3D detection. Specifically, FOMO-3D exploits rich semantic and depth priors from OWLv2 and Metric3Dv2 within a two-stage detection paradigm that first generates proposals with a LiDAR-based branch and a novel camera-based branch, and refines them with attention especially to image features from OWL. Evaluations on real-world driving data show that using rich priors from vision foundation models with careful multimodal fusion designs leads to large gains for long-tailed 3D detection.
APA
Yang, A.J., Tu, J., Dvornik, N., Li, E. & Urtasun, R.. (2025). FOMO-3D: Using Vision Foundation Models for Long-Tailed 3D Object Detection. Proceedings of The 9th Conference on Robot Learning, in Proceedings of Machine Learning Research 305:5526-5556 Available from https://proceedings.mlr.press/v305/yang25e.html.

Related Material