Hubness Reduction Improves Sentence-BERT Semantic Spaces

Beatrix Miranda Ginn Nielsen; Lars Kai Hansen

Hubness Reduction Improves Sentence-BERT Semantic Spaces

Beatrix Miranda Ginn Nielsen, Lars Kai Hansen

Proceedings of the 5th Northern Lights Deep Learning Conference ({NLDL}), PMLR 233:181-204, 2024.

Abstract

Semantic representations of text, i.e. representations of natural language which capture meaning by geometry, are essential for areas such as information retrieval and document grouping. High-dimensional trained dense vectors have received much attention in recent years as such representations. We investigate the structure of semantic spaces that arise from embeddings made with Sentence-BERT and find that the representations suffer from a well-known problem in high dimensions called hubness. Hubness results in asymmetric neighborhood relations, such that some texts (the hubs) are neighbours of many other texts while most texts (so-called anti-hubs), are neighbours of few or no other texts. We quantify the semantic quality of the embeddings using hubness scores and error rate of a neighbourhood based classifier. We find that when hubness is high, we can reduce error rate and hubness using hubness reduction methods. We identify a combination of two methods as resulting in the best reduction. For example, on one of the tested pretrained models, this combined method can reduce hubness by about 75% and error rate by about 9%. Thus, we argue that mitigating hubness in the embedding space provides better semantic representations of text.

Cite this Paper

BibTeX


@InProceedings{pmlr-v233-nielsen24a,
  title = 	 {Hubness Reduction Improves Sentence-{BERT} Semantic Spaces},
  author =       {Nielsen, Beatrix Miranda Ginn and Hansen, Lars Kai},
  booktitle = 	 {Proceedings of the 5th Northern Lights Deep Learning Conference ({NLDL})},
  pages = 	 {181--204},
  year = 	 {2024},
  editor = 	 {Lutchyn, Tetiana and Ramírez Rivera, Adín and Ricaud, Benjamin},
  volume = 	 {233},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {09--11 Jan},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v233/nielsen24a/nielsen24a.pdf},
  url = 	 {https://proceedings.mlr.press/v233/nielsen24a.html},
  abstract = 	 {Semantic representations of text, i.e. representations of natural language which capture meaning by geometry, are essential for areas such as information retrieval and document grouping. High-dimensional trained dense vectors have received much attention in recent years as such representations. We investigate the structure of semantic spaces that arise from embeddings made with Sentence-BERT and find that the representations suffer from a well-known problem in high dimensions called hubness. Hubness results in asymmetric neighborhood relations, such that some texts (the hubs) are neighbours of many other texts while most texts (so-called anti-hubs), are neighbours of few or no other texts. We quantify the semantic quality of the embeddings using hubness scores and error rate of a neighbourhood based classifier. We find that when hubness is high, we can reduce error rate and hubness using hubness reduction methods. We identify a combination of two methods as resulting in the best reduction. For example, on one of the tested pretrained models, this combined method can reduce hubness by about 75% and error rate by about 9%. Thus, we argue that mitigating hubness in the embedding space provides better semantic representations of text.}
}

Endnote

%0 Conference Paper
%T Hubness Reduction Improves Sentence-BERT Semantic Spaces
%A Beatrix Miranda Ginn Nielsen
%A Lars Kai Hansen
%B Proceedings of the 5th Northern Lights Deep Learning Conference ({NLDL})
%C Proceedings of Machine Learning Research
%D 2024
%E Tetiana Lutchyn
%E Adín Ramírez Rivera
%E Benjamin Ricaud	
%F pmlr-v233-nielsen24a
%I PMLR
%P 181--204
%U https://proceedings.mlr.press/v233/nielsen24a.html
%V 233
%X Semantic representations of text, i.e. representations of natural language which capture meaning by geometry, are essential for areas such as information retrieval and document grouping. High-dimensional trained dense vectors have received much attention in recent years as such representations. We investigate the structure of semantic spaces that arise from embeddings made with Sentence-BERT and find that the representations suffer from a well-known problem in high dimensions called hubness. Hubness results in asymmetric neighborhood relations, such that some texts (the hubs) are neighbours of many other texts while most texts (so-called anti-hubs), are neighbours of few or no other texts. We quantify the semantic quality of the embeddings using hubness scores and error rate of a neighbourhood based classifier. We find that when hubness is high, we can reduce error rate and hubness using hubness reduction methods. We identify a combination of two methods as resulting in the best reduction. For example, on one of the tested pretrained models, this combined method can reduce hubness by about 75% and error rate by about 9%. Thus, we argue that mitigating hubness in the embedding space provides better semantic representations of text.

APA


Nielsen, B.M.G. & Hansen, L.K.. (2024). Hubness Reduction Improves Sentence-BERT Semantic Spaces. Proceedings of the 5th Northern Lights Deep Learning Conference ({NLDL}), in Proceedings of Machine Learning Research 233:181-204 Available from https://proceedings.mlr.press/v233/nielsen24a.html.

Hubness Reduction Improves Sentence-BERT Semantic Spaces

Abstract

Cite this Paper

Related Material