Hyperbolic Image-text Representations

Karan Desai; Maximilian Nickel; Tanmay Rajpurohit; Justin Johnson; Shanmukha Ramakrishna Vedantam

Hyperbolic Image-text Representations

Karan Desai, Maximilian Nickel, Tanmay Rajpurohit, Justin Johnson, Shanmukha Ramakrishna Vedantam

Proceedings of the 40th International Conference on Machine Learning, PMLR 202:7694-7731, 2023.

Abstract

Visual and linguistic concepts naturally organize themselves in a hierarchy, where a textual concept "dog" entails all images that contain dogs. Despite being intuitive, current large-scale vision and language models such as CLIP do not explicitly capture such hierarchy. We propose MERU, a contrastive model that yields hyperbolic representations of images and text. Hyperbolic spaces have suitable geometric properties to embed tree-like data, so MERU can better capture the underlying hierarchy in image-text datasets. Our results show that MERU learns a highly interpretable and structured representation space while being competitive with CLIP’s performance on standard multi-modal tasks like image classification and image-text retrieval.

Cite this Paper

BibTeX

@InProceedings{pmlr-v202-desai23a,
  title = 	 {Hyperbolic Image-text Representations},
  author =       {Desai, Karan and Nickel, Maximilian and Rajpurohit, Tanmay and Johnson, Justin and Vedantam, Shanmukha Ramakrishna},
  booktitle = 	 {Proceedings of the 40th International Conference on Machine Learning},
  pages = 	 {7694--7731},
  year = 	 {2023},
  editor = 	 {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan},
  volume = 	 {202},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {23--29 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v202/desai23a/desai23a.pdf},
  url = 	 {https://proceedings.mlr.press/v202/desai23a.html},
  abstract = 	 {Visual and linguistic concepts naturally organize themselves in a hierarchy, where a textual concept "dog" entails all images that contain dogs. Despite being intuitive, current large-scale vision and language models such as CLIP do not explicitly capture such hierarchy. We propose MERU, a contrastive model that yields hyperbolic representations of images and text. Hyperbolic spaces have suitable geometric properties to embed tree-like data, so MERU can better capture the underlying hierarchy in image-text datasets. Our results show that MERU learns a highly interpretable and structured representation space while being competitive with CLIP’s performance on standard multi-modal tasks like image classification and image-text retrieval.}
}

Endnote

%0 Conference Paper
%T Hyperbolic Image-text Representations
%A Karan Desai
%A Maximilian Nickel
%A Tanmay Rajpurohit
%A Justin Johnson
%A Shanmukha Ramakrishna Vedantam
%B Proceedings of the 40th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2023
%E Andreas Krause
%E Emma Brunskill
%E Kyunghyun Cho
%E Barbara Engelhardt
%E Sivan Sabato
%E Jonathan Scarlett	
%F pmlr-v202-desai23a
%I PMLR
%P 7694--7731
%U https://proceedings.mlr.press/v202/desai23a.html
%V 202
%X Visual and linguistic concepts naturally organize themselves in a hierarchy, where a textual concept "dog" entails all images that contain dogs. Despite being intuitive, current large-scale vision and language models such as CLIP do not explicitly capture such hierarchy. We propose MERU, a contrastive model that yields hyperbolic representations of images and text. Hyperbolic spaces have suitable geometric properties to embed tree-like data, so MERU can better capture the underlying hierarchy in image-text datasets. Our results show that MERU learns a highly interpretable and structured representation space while being competitive with CLIP’s performance on standard multi-modal tasks like image classification and image-text retrieval.

APA

Desai, K., Nickel, M., Rajpurohit, T., Johnson, J. & Vedantam, S.R.. (2023). Hyperbolic Image-text Representations. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202:7694-7731 Available from https://proceedings.mlr.press/v202/desai23a.html.

Hyperbolic Image-text Representations

Abstract

Cite this Paper

Related Material