Pic@Point: Cross-Modal Learning by Local and Global Point-Picture Correspondence

Vencia Herzog, Stefan Suwelack
Proceedings of the 16th Asian Conference on Machine Learning, PMLR 260:703-718, 2025.

Abstract

Self-supervised pre-training has achieved remarkable success in NLP and 2D vision. However, these advances have yet to translate to 3D data. Techniques like masked reconstruction face inherent challenges on unstructured point clouds, while many contrastive learning tasks lack in complexity and informative value. In this paper, we present Pic@Point, an effective contrastive learning method based on structural 2D-3D correspondences. We leverage image cues rich in semantic and contextual knowledge to provide a guiding signal for point cloud representations at various abstraction levels. Our lightweight approach outperforms state-of-the-art pre-training methods on several 3D benchmarks.

Cite this Paper


BibTeX
@InProceedings{pmlr-v260-herzog25a, title = {{Pic@Point}: {C}ross-Modal Learning by Local and Global Point-Picture Correspondence}, author = {Herzog, Vencia and Suwelack, Stefan}, booktitle = {Proceedings of the 16th Asian Conference on Machine Learning}, pages = {703--718}, year = {2025}, editor = {Nguyen, Vu and Lin, Hsuan-Tien}, volume = {260}, series = {Proceedings of Machine Learning Research}, month = {05--08 Dec}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v260/main/assets/herzog25a/herzog25a.pdf}, url = {https://proceedings.mlr.press/v260/herzog25a.html}, abstract = {Self-supervised pre-training has achieved remarkable success in NLP and 2D vision. However, these advances have yet to translate to 3D data. Techniques like masked reconstruction face inherent challenges on unstructured point clouds, while many contrastive learning tasks lack in complexity and informative value. In this paper, we present Pic@Point, an effective contrastive learning method based on structural 2D-3D correspondences. We leverage image cues rich in semantic and contextual knowledge to provide a guiding signal for point cloud representations at various abstraction levels. Our lightweight approach outperforms state-of-the-art pre-training methods on several 3D benchmarks.} }
Endnote
%0 Conference Paper %T Pic@Point: Cross-Modal Learning by Local and Global Point-Picture Correspondence %A Vencia Herzog %A Stefan Suwelack %B Proceedings of the 16th Asian Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Vu Nguyen %E Hsuan-Tien Lin %F pmlr-v260-herzog25a %I PMLR %P 703--718 %U https://proceedings.mlr.press/v260/herzog25a.html %V 260 %X Self-supervised pre-training has achieved remarkable success in NLP and 2D vision. However, these advances have yet to translate to 3D data. Techniques like masked reconstruction face inherent challenges on unstructured point clouds, while many contrastive learning tasks lack in complexity and informative value. In this paper, we present Pic@Point, an effective contrastive learning method based on structural 2D-3D correspondences. We leverage image cues rich in semantic and contextual knowledge to provide a guiding signal for point cloud representations at various abstraction levels. Our lightweight approach outperforms state-of-the-art pre-training methods on several 3D benchmarks.
APA
Herzog, V. & Suwelack, S.. (2025). Pic@Point: Cross-Modal Learning by Local and Global Point-Picture Correspondence. Proceedings of the 16th Asian Conference on Machine Learning, in Proceedings of Machine Learning Research 260:703-718 Available from https://proceedings.mlr.press/v260/herzog25a.html.

Related Material