Evaluating Foundation Models’ 3D Understanding Through Multi-View Correspondence Analysis

Valentina Lilova, Toyesh Chakravorty, Julian I. Bibo, Emma Boccaletti, Brandon Li, Lívia Baxová, Cees G. M. Snoek, Mohammadreza Salehi
Proceedings of UniReps: the Third Edition of the Workshop on Unifying Representations in Neural Models, PMLR 322:290-316, 2026.

Abstract

This paper extends the Hummingbird framework with the Multi-View ImageNet (MVImgNet) dataset to evaluate how foundation model image encoders handle in-context object segmentation under unseen camera angles. We group MVImgNet object views and construct memory banks from selected viewpoints, assessing generalization by evaluating performance on held-out angles. In addition to seven pretrained Vision Transformer (ViT) models (CLIP, DINO, DINOv2, DINOv3, SigLIP2, C-RADIOv2, and TIPS), we include VGGT, a geometry-grounded ViT model trained for multi-view 3D scene understanding. Our results show that DINO-based encoders remain competitive across large viewpoint shifts, while 3D-aware models like VGGT require a dedicated multi-view implementation to properly reveal their geometric reasoning capabilities. These findings highlight the benefits of contrastive pretraining for robust performance across large viewpoint shifts. Our code is publicly available at https://github.com/ToyeshC/open-hummingbird-3d-eval.

Cite this Paper


BibTeX
@InProceedings{pmlr-v322-lilova26a, title = {Evaluating Foundation Models’ 3D Understanding Through Multi-View Correspondence Analysis}, author = {Lilova, Valentina and Chakravorty, Toyesh and Bibo, Julian I. and Boccaletti, Emma and Li, Brandon and Baxov\'{a}, L\'{i}via and Snoek, Cees G. M. and Salehi, Mohammadreza}, booktitle = {Proceedings of UniReps: the Third Edition of the Workshop on Unifying Representations in Neural Models}, pages = {290--316}, year = {2026}, editor = {Fumero, Marco and Domine, Clementine and L"ahner, Zorah and Cannistraci, Irene and Zhao, Bo and Williams, Alex}, volume = {322}, series = {Proceedings of Machine Learning Research}, month = {06 Dec}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v322/main/assets/lilova26a/lilova26a.pdf}, url = {https://proceedings.mlr.press/v322/lilova26a.html}, abstract = {This paper extends the Hummingbird framework with the Multi-View ImageNet (MVImgNet) dataset to evaluate how foundation model image encoders handle in-context object segmentation under unseen camera angles. We group MVImgNet object views and construct memory banks from selected viewpoints, assessing generalization by evaluating performance on held-out angles. In addition to seven pretrained Vision Transformer (ViT) models (CLIP, DINO, DINOv2, DINOv3, SigLIP2, C-RADIOv2, and TIPS), we include VGGT, a geometry-grounded ViT model trained for multi-view 3D scene understanding. Our results show that DINO-based encoders remain competitive across large viewpoint shifts, while 3D-aware models like VGGT require a dedicated multi-view implementation to properly reveal their geometric reasoning capabilities. These findings highlight the benefits of contrastive pretraining for robust performance across large viewpoint shifts. Our code is publicly available at https://github.com/ToyeshC/open-hummingbird-3d-eval.} }
Endnote
%0 Conference Paper %T Evaluating Foundation Models’ 3D Understanding Through Multi-View Correspondence Analysis %A Valentina Lilova %A Toyesh Chakravorty %A Julian I. Bibo %A Emma Boccaletti %A Brandon Li %A Lívia Baxová %A Cees G. M. Snoek %A Mohammadreza Salehi %B Proceedings of UniReps: the Third Edition of the Workshop on Unifying Representations in Neural Models %C Proceedings of Machine Learning Research %D 2026 %E Marco Fumero %E Clementine Domine %E Zorah L"ahner %E Irene Cannistraci %E Bo Zhao %E Alex Williams %F pmlr-v322-lilova26a %I PMLR %P 290--316 %U https://proceedings.mlr.press/v322/lilova26a.html %V 322 %X This paper extends the Hummingbird framework with the Multi-View ImageNet (MVImgNet) dataset to evaluate how foundation model image encoders handle in-context object segmentation under unseen camera angles. We group MVImgNet object views and construct memory banks from selected viewpoints, assessing generalization by evaluating performance on held-out angles. In addition to seven pretrained Vision Transformer (ViT) models (CLIP, DINO, DINOv2, DINOv3, SigLIP2, C-RADIOv2, and TIPS), we include VGGT, a geometry-grounded ViT model trained for multi-view 3D scene understanding. Our results show that DINO-based encoders remain competitive across large viewpoint shifts, while 3D-aware models like VGGT require a dedicated multi-view implementation to properly reveal their geometric reasoning capabilities. These findings highlight the benefits of contrastive pretraining for robust performance across large viewpoint shifts. Our code is publicly available at https://github.com/ToyeshC/open-hummingbird-3d-eval.
APA
Lilova, V., Chakravorty, T., Bibo, J.I., Boccaletti, E., Li, B., Baxová, L., Snoek, C.G.M. & Salehi, M.. (2026). Evaluating Foundation Models’ 3D Understanding Through Multi-View Correspondence Analysis. Proceedings of UniReps: the Third Edition of the Workshop on Unifying Representations in Neural Models, in Proceedings of Machine Learning Research 322:290-316 Available from https://proceedings.mlr.press/v322/lilova26a.html.

Related Material