Evaluating Foundation Models’ 3D Understanding Through Multi-View Correspondence Analysis

Valentina Lilova; Toyesh Chakravorty; Julian I. Bibo; Emma Boccaletti; Brandon Li; Lívia Baxová; Cees G. M. Snoek; Mohammadreza Salehi

Evaluating Foundation Models’ 3D Understanding Through Multi-View Correspondence Analysis

Valentina Lilova, Toyesh Chakravorty, Julian I. Bibo, Emma Boccaletti, Brandon Li, Lívia Baxová, Cees G. M. Snoek, Mohammadreza Salehi

Proceedings of UniReps: the Third Edition of the Workshop on Unifying Representations in Neural Models, PMLR 322:290-316, 2026.

Abstract

Benchmarking 3D spatial understanding of foundation models is essential for real-world applications such as robotics and autonomous driving. Existing evaluations often rely on downstream fine-tuning with linear heads or task-specific decoders, making it difficult to isolate the intrinsic 3D reasoning ability of pre-trained encoders. In this work, we introduce a novel benchmark for in-context 3D scene understanding that requires no fine-tuning and directly probes the quality of dense visual features. Building on the Hummingbird framework, which evaluates in-context 2D scene understanding, we extend the setup to the 3D Multi-View ImageNet (MVImgNet) dataset. Given a set of images depicting objects at specific camera angles (keys), we benchmark the performance of segmenting novel views (queries) and report the scores in 4 categories of easy, medium, hard, and extreme based on the key-query view contrast. We benchmark 7 state-of-the-art foundation models and show that DINO-based encoders remain competitive across large viewpoint shifts. Our code is publicly available at https://github.com/ToyeshC/open-hummingbird-3d-eval.

Cite this Paper

BibTeX

@InProceedings{pmlr-v322-lilova26a,
  title = 	 {Evaluating Foundation Models’ 3D Understanding Through Multi-View Correspondence Analysis},
  author =       {Lilova, Valentina and Chakravorty, Toyesh and Bibo, Julian I. and Boccaletti, Emma and Li, Brandon and Baxov\'{a}, L\'{i}via and Snoek, Cees G. M. and Salehi, Mohammadreza},
  booktitle = 	 {Proceedings of UniReps: the Third Edition of the Workshop on Unifying Representations in Neural Models},
  pages = 	 {290--316},
  year = 	 {2026},
  editor = 	 {Fumero, Marco and Domine, Clementine and L"ahner, Zorah and Cannistraci, Irene and Zhao, Bo and Williams, Alex},
  volume = 	 {322},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {06 Dec},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v322/main/assets/lilova26a/lilova26a.pdf},
  url = 	 {https://proceedings.mlr.press/v322/lilova26a.html},
  abstract = 	 {Benchmarking 3D spatial understanding of foundation models is essential for real-world applications such as robotics and autonomous driving. Existing evaluations often rely on downstream fine-tuning with linear heads or task-specific decoders, making it difficult to isolate the intrinsic 3D reasoning ability of pre-trained encoders. In this work, we introduce a novel benchmark for in-context 3D scene understanding that requires no fine-tuning and directly probes the quality of dense visual features. Building on the Hummingbird framework, which evaluates in-context 2D scene understanding, we extend the setup to the 3D Multi-View ImageNet (MVImgNet) dataset. Given a set of images depicting objects at specific camera angles (keys), we benchmark the performance of segmenting novel views (queries) and report the scores in 4 categories of easy, medium, hard, and extreme based on the key-query view contrast. We benchmark 7 state-of-the-art foundation models and show that DINO-based encoders remain competitive across large viewpoint shifts. Our code is publicly available at https://github.com/ToyeshC/open-hummingbird-3d-eval.}
}

Endnote

%0 Conference Paper
%T Evaluating Foundation Models’ 3D Understanding Through Multi-View Correspondence Analysis
%A Valentina Lilova
%A Toyesh Chakravorty
%A Julian I. Bibo
%A Emma Boccaletti
%A Brandon Li
%A Lívia Baxová
%A Cees G. M. Snoek
%A Mohammadreza Salehi
%B Proceedings of UniReps: the Third Edition of the Workshop on Unifying Representations in Neural Models
%C Proceedings of Machine Learning Research
%D 2026
%E Marco Fumero
%E Clementine Domine
%E Zorah L"ahner
%E Irene Cannistraci
%E Bo Zhao
%E Alex Williams	
%F pmlr-v322-lilova26a
%I PMLR
%P 290--316
%U https://proceedings.mlr.press/v322/lilova26a.html
%V 322
%X Benchmarking 3D spatial understanding of foundation models is essential for real-world applications such as robotics and autonomous driving. Existing evaluations often rely on downstream fine-tuning with linear heads or task-specific decoders, making it difficult to isolate the intrinsic 3D reasoning ability of pre-trained encoders. In this work, we introduce a novel benchmark for in-context 3D scene understanding that requires no fine-tuning and directly probes the quality of dense visual features. Building on the Hummingbird framework, which evaluates in-context 2D scene understanding, we extend the setup to the 3D Multi-View ImageNet (MVImgNet) dataset. Given a set of images depicting objects at specific camera angles (keys), we benchmark the performance of segmenting novel views (queries) and report the scores in 4 categories of easy, medium, hard, and extreme based on the key-query view contrast. We benchmark 7 state-of-the-art foundation models and show that DINO-based encoders remain competitive across large viewpoint shifts. Our code is publicly available at https://github.com/ToyeshC/open-hummingbird-3d-eval.

APA

Lilova, V., Chakravorty, T., Bibo, J.I., Boccaletti, E., Li, B., Baxová, L., Snoek, C.G.M. & Salehi, M.. (2026). Evaluating Foundation Models’ 3D Understanding Through Multi-View Correspondence Analysis. Proceedings of UniReps: the Third Edition of the Workshop on Unifying Representations in Neural Models, in Proceedings of Machine Learning Research 322:290-316 Available from https://proceedings.mlr.press/v322/lilova26a.html.

Evaluating Foundation Models’ 3D Understanding Through Multi-View Correspondence Analysis

Abstract

Cite this Paper

Related Material