Hubness Reduction Improves Sentence-BERT Semantic Spaces

Beatrix Miranda Ginn Nielsen, Lars Kai Hansen
Proceedings of the 5th Northern Lights Deep Learning Conference ({NLDL}), PMLR 233:181-204, 2024.

Abstract

Semantic representations of text, i.e. representations of natural language which capture meaning by geometry, are essential for areas such as information retrieval and document grouping. High-dimensional trained dense vectors have received much attention in recent years as such representations. We investigate the structure of semantic spaces that arise from embeddings made with Sentence-BERT and find that the representations suffer from a well-known problem in high dimensions called hubness. Hubness results in asymmetric neighborhood relations, such that some texts (the hubs) are neighbours of many other texts while most texts (so-called anti-hubs), are neighbours of few or no other texts. We quantify the semantic quality of the embeddings using hubness scores and error rate of a neighbourhood based classifier. We find that when hubness is high, we can reduce error rate and hubness using hubness reduction methods. We identify a combination of two methods as resulting in the best reduction. For example, on one of the tested pretrained models, this combined method can reduce hubness by about 75% and error rate by about 9%. Thus, we argue that mitigating hubness in the embedding space provides better semantic representations of text.

Cite this Paper


BibTeX
@InProceedings{pmlr-v233-nielsen24a, title = {Hubness Reduction Improves Sentence-{BERT} Semantic Spaces}, author = {Nielsen, Beatrix Miranda Ginn and Hansen, Lars Kai}, booktitle = {Proceedings of the 5th Northern Lights Deep Learning Conference ({NLDL})}, pages = {181--204}, year = {2024}, editor = {Lutchyn, Tetiana and Ramírez Rivera, Adín and Ricaud, Benjamin}, volume = {233}, series = {Proceedings of Machine Learning Research}, month = {09--11 Jan}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v233/nielsen24a/nielsen24a.pdf}, url = {https://proceedings.mlr.press/v233/nielsen24a.html}, abstract = {Semantic representations of text, i.e. representations of natural language which capture meaning by geometry, are essential for areas such as information retrieval and document grouping. High-dimensional trained dense vectors have received much attention in recent years as such representations. We investigate the structure of semantic spaces that arise from embeddings made with Sentence-BERT and find that the representations suffer from a well-known problem in high dimensions called hubness. Hubness results in asymmetric neighborhood relations, such that some texts (the hubs) are neighbours of many other texts while most texts (so-called anti-hubs), are neighbours of few or no other texts. We quantify the semantic quality of the embeddings using hubness scores and error rate of a neighbourhood based classifier. We find that when hubness is high, we can reduce error rate and hubness using hubness reduction methods. We identify a combination of two methods as resulting in the best reduction. For example, on one of the tested pretrained models, this combined method can reduce hubness by about 75% and error rate by about 9%. Thus, we argue that mitigating hubness in the embedding space provides better semantic representations of text.} }
Endnote
%0 Conference Paper %T Hubness Reduction Improves Sentence-BERT Semantic Spaces %A Beatrix Miranda Ginn Nielsen %A Lars Kai Hansen %B Proceedings of the 5th Northern Lights Deep Learning Conference ({NLDL}) %C Proceedings of Machine Learning Research %D 2024 %E Tetiana Lutchyn %E Adín Ramírez Rivera %E Benjamin Ricaud %F pmlr-v233-nielsen24a %I PMLR %P 181--204 %U https://proceedings.mlr.press/v233/nielsen24a.html %V 233 %X Semantic representations of text, i.e. representations of natural language which capture meaning by geometry, are essential for areas such as information retrieval and document grouping. High-dimensional trained dense vectors have received much attention in recent years as such representations. We investigate the structure of semantic spaces that arise from embeddings made with Sentence-BERT and find that the representations suffer from a well-known problem in high dimensions called hubness. Hubness results in asymmetric neighborhood relations, such that some texts (the hubs) are neighbours of many other texts while most texts (so-called anti-hubs), are neighbours of few or no other texts. We quantify the semantic quality of the embeddings using hubness scores and error rate of a neighbourhood based classifier. We find that when hubness is high, we can reduce error rate and hubness using hubness reduction methods. We identify a combination of two methods as resulting in the best reduction. For example, on one of the tested pretrained models, this combined method can reduce hubness by about 75% and error rate by about 9%. Thus, we argue that mitigating hubness in the embedding space provides better semantic representations of text.
APA
Nielsen, B.M.G. & Hansen, L.K.. (2024). Hubness Reduction Improves Sentence-BERT Semantic Spaces. Proceedings of the 5th Northern Lights Deep Learning Conference ({NLDL}), in Proceedings of Machine Learning Research 233:181-204 Available from https://proceedings.mlr.press/v233/nielsen24a.html.

Related Material