Testing the Limits of Fine-Tuning for Improving Visual Cognition in Vision Language Models

Luca M. Schulze Buschoff, Konstantinos Voudouris, Elif Akata, Matthias Bethge, Joshua B. Tenenbaum, Eric Schulz
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:53645-53662, 2025.

Abstract

Pre-trained vision language models still fall short of human visual cognition. In an effort to improve visual cognition and align models with human behavior, we introduce visual stimuli and human judgments on visual cognition tasks, allowing us to systematically evaluate performance across cognitive domains under a consistent environment. We fine-tune models on ground truth data for intuitive physics and causal reasoning and find that this improves model performance in the respective fine-tuning domain. Furthermore, it can improve model alignment with human behavior. However, we find that task-specific fine-tuning does not contribute to robust human-like generalization to data with other visual characteristics or to tasks in other cognitive domains.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-schulze-buschoff25a, title = {Testing the Limits of Fine-Tuning for Improving Visual Cognition in Vision Language Models}, author = {Schulze Buschoff, Luca M. and Voudouris, Konstantinos and Akata, Elif and Bethge, Matthias and Tenenbaum, Joshua B. and Schulz, Eric}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {53645--53662}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/schulze-buschoff25a/schulze-buschoff25a.pdf}, url = {https://proceedings.mlr.press/v267/schulze-buschoff25a.html}, abstract = {Pre-trained vision language models still fall short of human visual cognition. In an effort to improve visual cognition and align models with human behavior, we introduce visual stimuli and human judgments on visual cognition tasks, allowing us to systematically evaluate performance across cognitive domains under a consistent environment. We fine-tune models on ground truth data for intuitive physics and causal reasoning and find that this improves model performance in the respective fine-tuning domain. Furthermore, it can improve model alignment with human behavior. However, we find that task-specific fine-tuning does not contribute to robust human-like generalization to data with other visual characteristics or to tasks in other cognitive domains.} }
Endnote
%0 Conference Paper %T Testing the Limits of Fine-Tuning for Improving Visual Cognition in Vision Language Models %A Luca M. Schulze Buschoff %A Konstantinos Voudouris %A Elif Akata %A Matthias Bethge %A Joshua B. Tenenbaum %A Eric Schulz %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-schulze-buschoff25a %I PMLR %P 53645--53662 %U https://proceedings.mlr.press/v267/schulze-buschoff25a.html %V 267 %X Pre-trained vision language models still fall short of human visual cognition. In an effort to improve visual cognition and align models with human behavior, we introduce visual stimuli and human judgments on visual cognition tasks, allowing us to systematically evaluate performance across cognitive domains under a consistent environment. We fine-tune models on ground truth data for intuitive physics and causal reasoning and find that this improves model performance in the respective fine-tuning domain. Furthermore, it can improve model alignment with human behavior. However, we find that task-specific fine-tuning does not contribute to robust human-like generalization to data with other visual characteristics or to tasks in other cognitive domains.
APA
Schulze Buschoff, L.M., Voudouris, K., Akata, E., Bethge, M., Tenenbaum, J.B. & Schulz, E.. (2025). Testing the Limits of Fine-Tuning for Improving Visual Cognition in Vision Language Models. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:53645-53662 Available from https://proceedings.mlr.press/v267/schulze-buschoff25a.html.

Related Material