Shelf-Supervised Cross-Modal Pre-Training for 3D Object Detection

Mehar Khurana; Neehar Peri; James Hays; Deva Ramanan

Shelf-Supervised Cross-Modal Pre-Training for 3D Object Detection

Mehar Khurana, Neehar Peri, James Hays, Deva Ramanan

Proceedings of The 8th Conference on Robot Learning, PMLR 270:2080-2103, 2025.

Abstract

State-of-the-art 3D object detectors are often trained on massive labeled datasets. However, annotating 3D bounding boxes remains prohibitively expensive and time-consuming, particularly for LiDAR. Instead, recent works demonstrate that self-supervised pre-training with unlabeled data can improve detection accuracy with limited labels. Contemporary methods adapt best-practices for self-supervised learning from the image domain to point clouds (such as contrastive learning). However, publicly available 3D datasets are considerably smaller and less diverse than those used for image-based self-supervised learning, limiting their effectiveness. We do note, however, that such data is naturally collected in a multimodal fashion, often paired with images. Rather than pre-training with only self-supervised objectives, we argue that it is better to bootstrap point cloud representations using image-based foundation models trained on internet-scale image data. Specifically, we propose a shelf-supervised approach (e.g. supervised with off-the-shelf image foundation models) for generating zero-shot 3D bounding boxes from paired RGB and LiDAR data. Pre-training 3D detectors with such pseudo-labels yields significantly better semi-supervised detection accuracy than prior self-supervised pretext tasks. Importantly, we show that image-based shelf-supervision is helpful for training LiDAR-only and multi-modal (RGB + LiDAR) detectors. We demonstrate the effectiveness of our approach on nuScenes and WOD, significantly improving over prior work in limited data settings.

Cite this Paper

BibTeX

@InProceedings{pmlr-v270-khurana25a,
  title = 	 {Shelf-Supervised Cross-Modal Pre-Training for 3D Object Detection},
  author =       {Khurana, Mehar and Peri, Neehar and Hays, James and Ramanan, Deva},
  booktitle = 	 {Proceedings of The 8th Conference on Robot Learning},
  pages = 	 {2080--2103},
  year = 	 {2025},
  editor = 	 {Agrawal, Pulkit and Kroemer, Oliver and Burgard, Wolfram},
  volume = 	 {270},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {06--09 Nov},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v270/main/assets/khurana25a/khurana25a.pdf},
  url = 	 {https://proceedings.mlr.press/v270/khurana25a.html},
  abstract = 	 {State-of-the-art 3D object detectors are often trained on massive labeled datasets. However, annotating 3D bounding boxes remains prohibitively expensive and time-consuming, particularly for LiDAR. Instead, recent works demonstrate that self-supervised pre-training with unlabeled data can improve detection accuracy with limited labels. Contemporary methods adapt best-practices for self-supervised learning from the image domain to point clouds (such as contrastive learning). However, publicly available 3D datasets are considerably smaller and less diverse than those used for image-based self-supervised learning, limiting their effectiveness. We do note, however, that such data is naturally collected in a multimodal fashion, often paired with images. Rather than pre-training with only self-supervised objectives, we argue that it is better to bootstrap point cloud representations using image-based foundation models trained on internet-scale image data. Specifically, we propose a shelf-supervised approach (e.g. supervised with off-the-shelf image foundation models) for generating zero-shot 3D bounding boxes from paired RGB and LiDAR data. Pre-training 3D detectors with such pseudo-labels yields significantly better semi-supervised detection accuracy than prior self-supervised pretext tasks. Importantly, we show that image-based shelf-supervision is helpful for training LiDAR-only and multi-modal (RGB + LiDAR) detectors. We demonstrate the effectiveness of our approach on nuScenes and WOD, significantly improving over prior work in limited data settings.}
}

Endnote

%0 Conference Paper
%T Shelf-Supervised Cross-Modal Pre-Training for 3D Object Detection
%A Mehar Khurana
%A Neehar Peri
%A James Hays
%A Deva Ramanan
%B Proceedings of The 8th Conference on Robot Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Pulkit Agrawal
%E Oliver Kroemer
%E Wolfram Burgard	
%F pmlr-v270-khurana25a
%I PMLR
%P 2080--2103
%U https://proceedings.mlr.press/v270/khurana25a.html
%V 270
%X State-of-the-art 3D object detectors are often trained on massive labeled datasets. However, annotating 3D bounding boxes remains prohibitively expensive and time-consuming, particularly for LiDAR. Instead, recent works demonstrate that self-supervised pre-training with unlabeled data can improve detection accuracy with limited labels. Contemporary methods adapt best-practices for self-supervised learning from the image domain to point clouds (such as contrastive learning). However, publicly available 3D datasets are considerably smaller and less diverse than those used for image-based self-supervised learning, limiting their effectiveness. We do note, however, that such data is naturally collected in a multimodal fashion, often paired with images. Rather than pre-training with only self-supervised objectives, we argue that it is better to bootstrap point cloud representations using image-based foundation models trained on internet-scale image data. Specifically, we propose a shelf-supervised approach (e.g. supervised with off-the-shelf image foundation models) for generating zero-shot 3D bounding boxes from paired RGB and LiDAR data. Pre-training 3D detectors with such pseudo-labels yields significantly better semi-supervised detection accuracy than prior self-supervised pretext tasks. Importantly, we show that image-based shelf-supervision is helpful for training LiDAR-only and multi-modal (RGB + LiDAR) detectors. We demonstrate the effectiveness of our approach on nuScenes and WOD, significantly improving over prior work in limited data settings.

APA

Khurana, M., Peri, N., Hays, J. & Ramanan, D.. (2025). Shelf-Supervised Cross-Modal Pre-Training for 3D Object Detection. Proceedings of The 8th Conference on Robot Learning, in Proceedings of Machine Learning Research 270:2080-2103 Available from https://proceedings.mlr.press/v270/khurana25a.html.

Shelf-Supervised Cross-Modal Pre-Training for 3D Object Detection

Abstract

Cite this Paper

Related Material