Reliable Measures of Spread in High Dimensional Latent Spaces

Anna Marbut; Katy Mckinney-Bock; Travis Wheeler

Reliable Measures of Spread in High Dimensional Latent Spaces

Anna Marbut, Katy Mckinney-Bock, Travis Wheeler

Proceedings of the 40th International Conference on Machine Learning, PMLR 202:23871-23885, 2023.

Abstract

Understanding geometric properties of the latent spaces of natural language processing models allows the manipulation of these properties for improved performance on downstream tasks. One such property is the amount of data spread in a model’s latent space, or how fully the available latent space is being used. We demonstrate that the commonly used measures of data spread, average cosine similarity and a partition function min/max ratio I(V), do not provide reliable metrics to compare the use of latent space across data distributions. We propose and examine six alternative measures of data spread, all of which improve over these current metrics when applied to seven synthetic data distributions. Of our proposed measures, we recommend one principal component-based measure and one entropy-based measure that provide reliable, relative measures of spread and can be used to compare models of different sizes and dimensionalities.

Cite this Paper

BibTeX


@InProceedings{pmlr-v202-marbut23a,
  title = 	 {Reliable Measures of Spread in High Dimensional Latent Spaces},
  author =       {Marbut, Anna and Mckinney-Bock, Katy and Wheeler, Travis},
  booktitle = 	 {Proceedings of the 40th International Conference on Machine Learning},
  pages = 	 {23871--23885},
  year = 	 {2023},
  editor = 	 {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan},
  volume = 	 {202},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {23--29 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v202/marbut23a/marbut23a.pdf},
  url = 	 {https://proceedings.mlr.press/v202/marbut23a.html},
  abstract = 	 {Understanding geometric properties of the latent spaces of natural language processing models allows the manipulation of these properties for improved performance on downstream tasks. One such property is the amount of data spread in a model’s latent space, or how fully the available latent space is being used. We demonstrate that the commonly used measures of data spread, average cosine similarity and a partition function min/max ratio I(V), do not provide reliable metrics to compare the use of latent space across data distributions. We propose and examine six alternative measures of data spread, all of which improve over these current metrics when applied to seven synthetic data distributions. Of our proposed measures, we recommend one principal component-based measure and one entropy-based measure that provide reliable, relative measures of spread and can be used to compare models of different sizes and dimensionalities.}
}

Endnote

%0 Conference Paper
%T Reliable Measures of Spread in High Dimensional Latent Spaces
%A Anna Marbut
%A Katy Mckinney-Bock
%A Travis Wheeler
%B Proceedings of the 40th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2023
%E Andreas Krause
%E Emma Brunskill
%E Kyunghyun Cho
%E Barbara Engelhardt
%E Sivan Sabato
%E Jonathan Scarlett	
%F pmlr-v202-marbut23a
%I PMLR
%P 23871--23885
%U https://proceedings.mlr.press/v202/marbut23a.html
%V 202
%X Understanding geometric properties of the latent spaces of natural language processing models allows the manipulation of these properties for improved performance on downstream tasks. One such property is the amount of data spread in a model’s latent space, or how fully the available latent space is being used. We demonstrate that the commonly used measures of data spread, average cosine similarity and a partition function min/max ratio I(V), do not provide reliable metrics to compare the use of latent space across data distributions. We propose and examine six alternative measures of data spread, all of which improve over these current metrics when applied to seven synthetic data distributions. Of our proposed measures, we recommend one principal component-based measure and one entropy-based measure that provide reliable, relative measures of spread and can be used to compare models of different sizes and dimensionalities.

APA


Marbut, A., Mckinney-Bock, K. & Wheeler, T.. (2023). Reliable Measures of Spread in High Dimensional Latent Spaces. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202:23871-23885 Available from https://proceedings.mlr.press/v202/marbut23a.html.

Reliable Measures of Spread in High Dimensional Latent Spaces

Abstract

Cite this Paper

Related Material