A Balancing Act: Optimizing Classification and Retrieval in Cross-Modal Vision Models

Judith Lefkes; Clément Grisi; Geert Litjens

A Balancing Act: Optimizing Classification and Retrieval in Cross-Modal Vision Models

Judith Lefkes, Clément Grisi, Geert Litjens

Proceedings of The 8th International Conference on Medical Imaging with Deep Learning, PMLR 301:914-930, 2026.

Abstract

Despite the promising capabilities of vision-language models (VLMs) in diverse tasks, recent studies reveal that they struggle with the fundamental task of image classification. In this study, we explore leveraging state-of-the-art task-specific classification models as a foundation for VLMs, aiming to preserve strong classification performance. Specifically, we assess the impact of contrastive tuning to enable cross-modal retrieval capabilities on a Vision Transformer (ViT) model trained for multi-label classification on natural images and a Hierarchical Vision Transformer (H-ViT) trained for prostate cancer grading in Whole-Slide Images (WSIs). Our results demonstrate that contrastive fine-tuning creates a clear trade-off: classification accuracy rapidly deteriorates toward zero as vision-text alignment improves. By balancing task-specific and contrastive objectives in the loss function during fine-tuning, we achieve competitive slide-level retrieval performance while maintaining classification accuracy. Our code is available on https://github.com/DIAGNijmegen/tradeoff_classification_alignment.git.

Cite this Paper

BibTeX

@InProceedings{pmlr-v301-lefkes26a,
  title = 	 {A Balancing Act: Optimizing Classification and Retrieval in Cross-Modal Vision Models},
  author =       {Lefkes, Judith and Grisi, Cl\'ement and Litjens, Geert},
  booktitle = 	 {Proceedings of The 8th International Conference on Medical Imaging with Deep Learning},
  pages = 	 {914--930},
  year = 	 {2026},
  editor = 	 {Tasdizen, Tolga and Elhabian, Shireen and Summers, Ronald and Chen, Chen and Koch, Lisa and Zhuang, Yan},
  volume = 	 {301},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {09--11 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v301/main/assets/lefkes26a/lefkes26a.pdf},
  url = 	 {https://proceedings.mlr.press/v301/lefkes26a.html},
  abstract = 	 {Despite the promising capabilities of vision-language models (VLMs) in diverse tasks, recent studies reveal that they struggle with the fundamental task of image classification. In this study, we explore leveraging state-of-the-art task-specific classification models as a foundation for VLMs, aiming to preserve strong classification performance. Specifically, we assess the impact of contrastive tuning to enable cross-modal retrieval capabilities on a Vision Transformer (ViT) model trained for multi-label classification on natural images and a Hierarchical Vision Transformer (H-ViT) trained for prostate cancer grading in Whole-Slide Images (WSIs). Our results demonstrate that contrastive fine-tuning creates a clear trade-off: classification accuracy rapidly deteriorates toward zero as vision-text alignment improves. By balancing task-specific and contrastive objectives in the loss function during fine-tuning, we achieve competitive slide-level retrieval performance while maintaining classification accuracy. Our code is available on https://github.com/DIAGNijmegen/tradeoff_classification_alignment.git.}
}

Endnote

%0 Conference Paper
%T A Balancing Act: Optimizing Classification and Retrieval in Cross-Modal Vision Models
%A Judith Lefkes
%A Clément Grisi
%A Geert Litjens
%B Proceedings of The 8th International Conference on Medical Imaging with Deep Learning
%C Proceedings of Machine Learning Research
%D 2026
%E Tolga Tasdizen
%E Shireen Elhabian
%E Ronald Summers
%E Chen Chen
%E Lisa Koch
%E Yan Zhuang	
%F pmlr-v301-lefkes26a
%I PMLR
%P 914--930
%U https://proceedings.mlr.press/v301/lefkes26a.html
%V 301
%X Despite the promising capabilities of vision-language models (VLMs) in diverse tasks, recent studies reveal that they struggle with the fundamental task of image classification. In this study, we explore leveraging state-of-the-art task-specific classification models as a foundation for VLMs, aiming to preserve strong classification performance. Specifically, we assess the impact of contrastive tuning to enable cross-modal retrieval capabilities on a Vision Transformer (ViT) model trained for multi-label classification on natural images and a Hierarchical Vision Transformer (H-ViT) trained for prostate cancer grading in Whole-Slide Images (WSIs). Our results demonstrate that contrastive fine-tuning creates a clear trade-off: classification accuracy rapidly deteriorates toward zero as vision-text alignment improves. By balancing task-specific and contrastive objectives in the loss function during fine-tuning, we achieve competitive slide-level retrieval performance while maintaining classification accuracy. Our code is available on https://github.com/DIAGNijmegen/tradeoff_classification_alignment.git.

APA

Lefkes, J., Grisi, C. & Litjens, G.. (2026). A Balancing Act: Optimizing Classification and Retrieval in Cross-Modal Vision Models. Proceedings of The 8th International Conference on Medical Imaging with Deep Learning, in Proceedings of Machine Learning Research 301:914-930 Available from https://proceedings.mlr.press/v301/lefkes26a.html.

Related Material

Download PDF