Test-Time Canonicalization by Foundation Models for Robust Perception

Utkarsh Singhal, Ryan Feng, Stella X. Yu, Atul Prakash
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:55788-55809, 2025.

Abstract

Perception in the real world requires robustness to diverse viewing conditions. Existing approaches often rely on specialized architectures or training with predefined data augmentations, limiting adaptability. Taking inspiration from mental rotation in human vision, we propose FoCal, a test-time robustness framework that transforms the input into the most typical view. At inference time, FoCal explores a set of transformed images and chooses the one with the highest likelihood under foundation model priors. This test-time optimization boosts robustness while requiring no retraining or architectural changes. Applied to models like CLIP and SAM, it significantly boosts robustness across a wide range of transformations, including 2D and 3D rotations, contrast and lighting shifts, and day-night changes. We also explore potential applications in active vision. By reframing invariance as a test-time optimization problem, FoCal offers a general and scalable approach to robustness. Our code is available at: https://github.com/sutkarsh/focal .

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-singhal25a, title = {Test-Time Canonicalization by Foundation Models for Robust Perception}, author = {Singhal, Utkarsh and Feng, Ryan and Yu, Stella X. and Prakash, Atul}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {55788--55809}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/singhal25a/singhal25a.pdf}, url = {https://proceedings.mlr.press/v267/singhal25a.html}, abstract = {Perception in the real world requires robustness to diverse viewing conditions. Existing approaches often rely on specialized architectures or training with predefined data augmentations, limiting adaptability. Taking inspiration from mental rotation in human vision, we propose FoCal, a test-time robustness framework that transforms the input into the most typical view. At inference time, FoCal explores a set of transformed images and chooses the one with the highest likelihood under foundation model priors. This test-time optimization boosts robustness while requiring no retraining or architectural changes. Applied to models like CLIP and SAM, it significantly boosts robustness across a wide range of transformations, including 2D and 3D rotations, contrast and lighting shifts, and day-night changes. We also explore potential applications in active vision. By reframing invariance as a test-time optimization problem, FoCal offers a general and scalable approach to robustness. Our code is available at: https://github.com/sutkarsh/focal .} }
Endnote
%0 Conference Paper %T Test-Time Canonicalization by Foundation Models for Robust Perception %A Utkarsh Singhal %A Ryan Feng %A Stella X. Yu %A Atul Prakash %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-singhal25a %I PMLR %P 55788--55809 %U https://proceedings.mlr.press/v267/singhal25a.html %V 267 %X Perception in the real world requires robustness to diverse viewing conditions. Existing approaches often rely on specialized architectures or training with predefined data augmentations, limiting adaptability. Taking inspiration from mental rotation in human vision, we propose FoCal, a test-time robustness framework that transforms the input into the most typical view. At inference time, FoCal explores a set of transformed images and chooses the one with the highest likelihood under foundation model priors. This test-time optimization boosts robustness while requiring no retraining or architectural changes. Applied to models like CLIP and SAM, it significantly boosts robustness across a wide range of transformations, including 2D and 3D rotations, contrast and lighting shifts, and day-night changes. We also explore potential applications in active vision. By reframing invariance as a test-time optimization problem, FoCal offers a general and scalable approach to robustness. Our code is available at: https://github.com/sutkarsh/focal .
APA
Singhal, U., Feng, R., Yu, S.X. & Prakash, A.. (2025). Test-Time Canonicalization by Foundation Models for Robust Perception. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:55788-55809 Available from https://proceedings.mlr.press/v267/singhal25a.html.

Related Material