The Double-Ellipsoid Geometry of CLIP

Meir Yossef Levi, Guy Gilboa
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:33999-34019, 2025.

Abstract

Contrastive Language-Image Pre-Training (CLIP) is highly instrumental in machine learning applications within a large variety of domains. We investigate the geometry of this embedding, which is still not well understood, and show that text and image reside on linearly separable ellipsoid shells, not centered at the origin. We explain the benefits of having this structure, allowing to better embed instances according to their uncertainty during contrastive training. Frequent concepts in the dataset yield more false negatives, inducing greater uncertainty. A new notion of conformity is introduced, which measures the average cosine similarity of an instance to any other instance within a representative data set. We show this measure can be accurately estimated by simply computing the cosine similarity to the modality mean vector. Furthermore, we find that CLIP’s modality gap optimizes the matching of the conformity distributions of image and text.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-levi25b, title = {The Double-Ellipsoid Geometry of {CLIP}}, author = {Levi, Meir Yossef and Gilboa, Guy}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {33999--34019}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/levi25b/levi25b.pdf}, url = {https://proceedings.mlr.press/v267/levi25b.html}, abstract = {Contrastive Language-Image Pre-Training (CLIP) is highly instrumental in machine learning applications within a large variety of domains. We investigate the geometry of this embedding, which is still not well understood, and show that text and image reside on linearly separable ellipsoid shells, not centered at the origin. We explain the benefits of having this structure, allowing to better embed instances according to their uncertainty during contrastive training. Frequent concepts in the dataset yield more false negatives, inducing greater uncertainty. A new notion of conformity is introduced, which measures the average cosine similarity of an instance to any other instance within a representative data set. We show this measure can be accurately estimated by simply computing the cosine similarity to the modality mean vector. Furthermore, we find that CLIP’s modality gap optimizes the matching of the conformity distributions of image and text.} }
Endnote
%0 Conference Paper %T The Double-Ellipsoid Geometry of CLIP %A Meir Yossef Levi %A Guy Gilboa %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-levi25b %I PMLR %P 33999--34019 %U https://proceedings.mlr.press/v267/levi25b.html %V 267 %X Contrastive Language-Image Pre-Training (CLIP) is highly instrumental in machine learning applications within a large variety of domains. We investigate the geometry of this embedding, which is still not well understood, and show that text and image reside on linearly separable ellipsoid shells, not centered at the origin. We explain the benefits of having this structure, allowing to better embed instances according to their uncertainty during contrastive training. Frequent concepts in the dataset yield more false negatives, inducing greater uncertainty. A new notion of conformity is introduced, which measures the average cosine similarity of an instance to any other instance within a representative data set. We show this measure can be accurately estimated by simply computing the cosine similarity to the modality mean vector. Furthermore, we find that CLIP’s modality gap optimizes the matching of the conformity distributions of image and text.
APA
Levi, M.Y. & Gilboa, G.. (2025). The Double-Ellipsoid Geometry of CLIP. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:33999-34019 Available from https://proceedings.mlr.press/v267/levi25b.html.

Related Material