Whitened CLIP as a Likelihood Surrogate of Images and Captions

Roy Betser, Meir Yossef Levi, Guy Gilboa
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:4069-4095, 2025.

Abstract

Likelihood approximations for images are not trivial to compute and can be useful in many applications. We examine the use of Contrastive Language-Image Pre-training (CLIP) to assess the likelihood of images and captions. We introduce Whitened CLIP, a novel transformation of the CLIP latent space via an invertible linear operation. This transformation ensures that each feature in the embedding space has zero mean, unit standard deviation, and no correlation with all other features, resulting in an identity covariance matrix. We show that the whitened embedding statistics can be well approximated by a standard normal distribution, allowing log-likelihood to be estimated using the squared Euclidean norm in the whitened space. The whitening procedure is completely training-free and uses a precomputed whitening matrix, making it extremely fast. We present several preliminary experiments demonstrating the properties and applicability of these likelihood scores to images and captions. Our code is available at github.com/rbetser/W_CLIP/tree/main.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-betser25a, title = {Whitened {CLIP} as a Likelihood Surrogate of Images and Captions}, author = {Betser, Roy and Levi, Meir Yossef and Gilboa, Guy}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {4069--4095}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/betser25a/betser25a.pdf}, url = {https://proceedings.mlr.press/v267/betser25a.html}, abstract = {Likelihood approximations for images are not trivial to compute and can be useful in many applications. We examine the use of Contrastive Language-Image Pre-training (CLIP) to assess the likelihood of images and captions. We introduce Whitened CLIP, a novel transformation of the CLIP latent space via an invertible linear operation. This transformation ensures that each feature in the embedding space has zero mean, unit standard deviation, and no correlation with all other features, resulting in an identity covariance matrix. We show that the whitened embedding statistics can be well approximated by a standard normal distribution, allowing log-likelihood to be estimated using the squared Euclidean norm in the whitened space. The whitening procedure is completely training-free and uses a precomputed whitening matrix, making it extremely fast. We present several preliminary experiments demonstrating the properties and applicability of these likelihood scores to images and captions. Our code is available at github.com/rbetser/W_CLIP/tree/main.} }
Endnote
%0 Conference Paper %T Whitened CLIP as a Likelihood Surrogate of Images and Captions %A Roy Betser %A Meir Yossef Levi %A Guy Gilboa %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-betser25a %I PMLR %P 4069--4095 %U https://proceedings.mlr.press/v267/betser25a.html %V 267 %X Likelihood approximations for images are not trivial to compute and can be useful in many applications. We examine the use of Contrastive Language-Image Pre-training (CLIP) to assess the likelihood of images and captions. We introduce Whitened CLIP, a novel transformation of the CLIP latent space via an invertible linear operation. This transformation ensures that each feature in the embedding space has zero mean, unit standard deviation, and no correlation with all other features, resulting in an identity covariance matrix. We show that the whitened embedding statistics can be well approximated by a standard normal distribution, allowing log-likelihood to be estimated using the squared Euclidean norm in the whitened space. The whitening procedure is completely training-free and uses a precomputed whitening matrix, making it extremely fast. We present several preliminary experiments demonstrating the properties and applicability of these likelihood scores to images and captions. Our code is available at github.com/rbetser/W_CLIP/tree/main.
APA
Betser, R., Levi, M.Y. & Gilboa, G.. (2025). Whitened CLIP as a Likelihood Surrogate of Images and Captions. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:4069-4095 Available from https://proceedings.mlr.press/v267/betser25a.html.

Related Material