Over the Top-1: Uncertainty-Aware Cross-Modal Retrieval with CLIP

Lluis Gomez
Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence, PMLR 286:1521-1532, 2025.

Abstract

State-of-the-art vision-language models, such as CLIP, achieve remarkable performance in cross-modal retrieval tasks, yet estimating their predictive uncertainty remains an open challenge. While recent works have explored probabilistic embeddings to quantify retrieval uncertainty, these approaches often require model retraining or fine-tuning adapters, making them computationally expensive and dataset-dependent. In this work, we propose a training-free framework for uncertainty estimation in cross-modal retrieval. We start with a simple yet effective baseline that uses the cosine similarity between a query and its top-ranked match as an uncertainty measure. Building on this, we introduce a method that estimates uncertainty by analyzing the consistency of the top-1 retrieved item across samples drawn from the posterior predictive distribution using Monte Carlo Dropout (MCD) or Deep Ensembles. Finally, we propose an adversarial perturbation-based method, where the minimal perturbation required to alter the top-1 retrieval serves as a robust indicator of uncertainty. Empirical results in two standard cross-modal retrieval benchmarks demonstrate that our approach achieves superior calibration compared to learned probabilistic methods, all while incurring zero additional training cost.

Cite this Paper


BibTeX
@InProceedings{pmlr-v286-gomez25a, title = {Over the Top-1: Uncertainty-Aware Cross-Modal Retrieval with CLIP}, author = {Gomez, Lluis}, booktitle = {Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence}, pages = {1521--1532}, year = {2025}, editor = {Chiappa, Silvia and Magliacane, Sara}, volume = {286}, series = {Proceedings of Machine Learning Research}, month = {21--25 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v286/main/assets/gomez25a/gomez25a.pdf}, url = {https://proceedings.mlr.press/v286/gomez25a.html}, abstract = {State-of-the-art vision-language models, such as CLIP, achieve remarkable performance in cross-modal retrieval tasks, yet estimating their predictive uncertainty remains an open challenge. While recent works have explored probabilistic embeddings to quantify retrieval uncertainty, these approaches often require model retraining or fine-tuning adapters, making them computationally expensive and dataset-dependent. In this work, we propose a training-free framework for uncertainty estimation in cross-modal retrieval. We start with a simple yet effective baseline that uses the cosine similarity between a query and its top-ranked match as an uncertainty measure. Building on this, we introduce a method that estimates uncertainty by analyzing the consistency of the top-1 retrieved item across samples drawn from the posterior predictive distribution using Monte Carlo Dropout (MCD) or Deep Ensembles. Finally, we propose an adversarial perturbation-based method, where the minimal perturbation required to alter the top-1 retrieval serves as a robust indicator of uncertainty. Empirical results in two standard cross-modal retrieval benchmarks demonstrate that our approach achieves superior calibration compared to learned probabilistic methods, all while incurring zero additional training cost.} }
Endnote
%0 Conference Paper %T Over the Top-1: Uncertainty-Aware Cross-Modal Retrieval with CLIP %A Lluis Gomez %B Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence %C Proceedings of Machine Learning Research %D 2025 %E Silvia Chiappa %E Sara Magliacane %F pmlr-v286-gomez25a %I PMLR %P 1521--1532 %U https://proceedings.mlr.press/v286/gomez25a.html %V 286 %X State-of-the-art vision-language models, such as CLIP, achieve remarkable performance in cross-modal retrieval tasks, yet estimating their predictive uncertainty remains an open challenge. While recent works have explored probabilistic embeddings to quantify retrieval uncertainty, these approaches often require model retraining or fine-tuning adapters, making them computationally expensive and dataset-dependent. In this work, we propose a training-free framework for uncertainty estimation in cross-modal retrieval. We start with a simple yet effective baseline that uses the cosine similarity between a query and its top-ranked match as an uncertainty measure. Building on this, we introduce a method that estimates uncertainty by analyzing the consistency of the top-1 retrieved item across samples drawn from the posterior predictive distribution using Monte Carlo Dropout (MCD) or Deep Ensembles. Finally, we propose an adversarial perturbation-based method, where the minimal perturbation required to alter the top-1 retrieval serves as a robust indicator of uncertainty. Empirical results in two standard cross-modal retrieval benchmarks demonstrate that our approach achieves superior calibration compared to learned probabilistic methods, all while incurring zero additional training cost.
APA
Gomez, L.. (2025). Over the Top-1: Uncertainty-Aware Cross-Modal Retrieval with CLIP. Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence, in Proceedings of Machine Learning Research 286:1521-1532 Available from https://proceedings.mlr.press/v286/gomez25a.html.

Related Material