Theia: Distilling Diverse Vision Foundation Models for Robot Learning

Jinghuan Shang, Karl Schmeckpeper, Brandon B. May, Maria Vittoria Minniti, Tarik Kelestemur, David Watkins, Laura Herlant
Proceedings of The 8th Conference on Robot Learning, PMLR 270:724-748, 2025.

Abstract

Vision-based robot policy learning, which maps visual inputs to actions, necessitates a holistic understanding of diverse visual tasks beyond single-task needs like classification or segmentation. Inspired by this, we introduce Theia, a vision foundation model for robot learning that distills multiple off-the-shelf vision foundation models trained on varied vision tasks. Theia’s rich visual representations encode diverse visual knowledge, enhancing downstream robot learning. Extensive experiments demonstrate that Theia outperforms its teacher models and prior robot learning models using less training data and smaller model sizes. Additionally, we quantify the quality of pre-trained visual representations and hypothesize that higher entropy in feature norm distributions leads to improved robot learning performance. Code, models, and demo are available at https://theia.theaiinstitute.com.

Cite this Paper


BibTeX
@InProceedings{pmlr-v270-shang25a, title = {Theia: Distilling Diverse Vision Foundation Models for Robot Learning}, author = {Shang, Jinghuan and Schmeckpeper, Karl and May, Brandon B. and Minniti, Maria Vittoria and Kelestemur, Tarik and Watkins, David and Herlant, Laura}, booktitle = {Proceedings of The 8th Conference on Robot Learning}, pages = {724--748}, year = {2025}, editor = {Agrawal, Pulkit and Kroemer, Oliver and Burgard, Wolfram}, volume = {270}, series = {Proceedings of Machine Learning Research}, month = {06--09 Nov}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v270/main/assets/shang25a/shang25a.pdf}, url = {https://proceedings.mlr.press/v270/shang25a.html}, abstract = {Vision-based robot policy learning, which maps visual inputs to actions, necessitates a holistic understanding of diverse visual tasks beyond single-task needs like classification or segmentation. Inspired by this, we introduce Theia, a vision foundation model for robot learning that distills multiple off-the-shelf vision foundation models trained on varied vision tasks. Theia’s rich visual representations encode diverse visual knowledge, enhancing downstream robot learning. Extensive experiments demonstrate that Theia outperforms its teacher models and prior robot learning models using less training data and smaller model sizes. Additionally, we quantify the quality of pre-trained visual representations and hypothesize that higher entropy in feature norm distributions leads to improved robot learning performance. Code, models, and demo are available at https://theia.theaiinstitute.com.} }
Endnote
%0 Conference Paper %T Theia: Distilling Diverse Vision Foundation Models for Robot Learning %A Jinghuan Shang %A Karl Schmeckpeper %A Brandon B. May %A Maria Vittoria Minniti %A Tarik Kelestemur %A David Watkins %A Laura Herlant %B Proceedings of The 8th Conference on Robot Learning %C Proceedings of Machine Learning Research %D 2025 %E Pulkit Agrawal %E Oliver Kroemer %E Wolfram Burgard %F pmlr-v270-shang25a %I PMLR %P 724--748 %U https://proceedings.mlr.press/v270/shang25a.html %V 270 %X Vision-based robot policy learning, which maps visual inputs to actions, necessitates a holistic understanding of diverse visual tasks beyond single-task needs like classification or segmentation. Inspired by this, we introduce Theia, a vision foundation model for robot learning that distills multiple off-the-shelf vision foundation models trained on varied vision tasks. Theia’s rich visual representations encode diverse visual knowledge, enhancing downstream robot learning. Extensive experiments demonstrate that Theia outperforms its teacher models and prior robot learning models using less training data and smaller model sizes. Additionally, we quantify the quality of pre-trained visual representations and hypothesize that higher entropy in feature norm distributions leads to improved robot learning performance. Code, models, and demo are available at https://theia.theaiinstitute.com.
APA
Shang, J., Schmeckpeper, K., May, B.B., Minniti, M.V., Kelestemur, T., Watkins, D. & Herlant, L.. (2025). Theia: Distilling Diverse Vision Foundation Models for Robot Learning. Proceedings of The 8th Conference on Robot Learning, in Proceedings of Machine Learning Research 270:724-748 Available from https://proceedings.mlr.press/v270/shang25a.html.

Related Material