From HEAR to GEAR: Generative Evaluation of Audio Representations

Vincent Lostanlen; Lingyao Yan; Xianyi Yang

From HEAR to GEAR: Generative Evaluation of Audio Representations

Vincent Lostanlen, Lingyao Yan, Xianyi Yang

HEAR: Holistic Evaluation of Audio Representations (NeurIPS 2021 Competition), PMLR 166:48-64, 2022.

Abstract

The “Holistic Evaluation of Audio Representations” (HEAR) is an emerging research program towards statistical models that can transfer to diverse machine listening tasks. The originality of HEAR is to conduct a fair, “apples-to-apples” comparison of many deep learning models over many datasets, resulting in multitask evaluation metrics that are readily interpretable by practitioners. On the flip side, this comparison incurs a neural architecture search: as such, it is not directly interpretable in terms of audio signal processing. In this paper, we propose a complementary viewpoint on the HEAR benchmark, which we name GEAR: Generative Evaluation of Audio Representations. The key idea behind GEAR is to generate a dataset of sounds with few independent factors of variability, analyze it with HEAR embeddings, and visualize it with an unsupervised manifold learning algorithm. Visual inspection reveals stark contrasts in the global structure of the nearest-neighbor graphs associated to logmelspec, Open-$L^3$, BYOL, CREPE, wav2vec2, GURA, and YAMNet. Although GEAR currently lacks mathematical refinement, we intend it as a proof of concept to show the potential of parametric audio synthesis in general-purpose machine listening research.

Cite this Paper

BibTeX


@InProceedings{pmlr-v166-lostanlen22a,
  title = 	 {From {HEAR} to {GEAR}: Generative Evaluation of Audio Representations},
  author =       {Lostanlen, Vincent and Yan, Lingyao and Yang, Xianyi},
  booktitle = 	 {HEAR: Holistic Evaluation of Audio Representations (NeurIPS 2021 Competition)},
  pages = 	 {48--64},
  year = 	 {2022},
  editor = 	 {Turian, Joseph and Schuller, Björn W. and Herremans, Dorien and Kirchoff, Katrin and Perera, Paola Garcia and Esling, Philippe},
  volume = 	 {166},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--14 Dec},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v166/lostanlen22a/lostanlen22a.pdf},
  url = 	 {https://proceedings.mlr.press/v166/lostanlen22a.html},
  abstract = 	 {The “Holistic Evaluation of Audio Representations” (HEAR) is an emerging research program towards statistical models that can transfer to diverse machine listening tasks. The originality of HEAR is to conduct a fair,  “apples-to-apples” comparison of many deep learning models over many datasets, resulting in multitask evaluation metrics that are readily interpretable by practitioners. On the flip side, this comparison incurs a neural architecture search: as such, it is not directly interpretable in terms of audio signal processing. In this paper, we propose a complementary viewpoint on the HEAR benchmark, which we name GEAR: Generative Evaluation of Audio Representations. The key idea behind GEAR is to generate a dataset of sounds with few independent factors of variability, analyze it with HEAR embeddings, and visualize it with an unsupervised manifold learning algorithm. Visual inspection reveals stark contrasts in the global structure of the nearest-neighbor graphs associated to logmelspec, Open-$L^3$, BYOL, CREPE, wav2vec2, GURA, and YAMNet. Although GEAR currently lacks mathematical refinement, we intend it as a proof of concept to show the potential of parametric audio synthesis in general-purpose machine listening research.}
}

Endnote

%0 Conference Paper
%T From HEAR to GEAR: Generative Evaluation of Audio Representations
%A Vincent Lostanlen
%A Lingyao Yan
%A Xianyi Yang
%B HEAR: Holistic Evaluation of Audio Representations (NeurIPS 2021 Competition)
%C Proceedings of Machine Learning Research
%D 2022
%E Joseph Turian
%E Björn W. Schuller
%E Dorien Herremans
%E Katrin Kirchoff
%E Paola Garcia Perera
%E Philippe Esling	
%F pmlr-v166-lostanlen22a
%I PMLR
%P 48--64
%U https://proceedings.mlr.press/v166/lostanlen22a.html
%V 166
%X The “Holistic Evaluation of Audio Representations” (HEAR) is an emerging research program towards statistical models that can transfer to diverse machine listening tasks. The originality of HEAR is to conduct a fair,  “apples-to-apples” comparison of many deep learning models over many datasets, resulting in multitask evaluation metrics that are readily interpretable by practitioners. On the flip side, this comparison incurs a neural architecture search: as such, it is not directly interpretable in terms of audio signal processing. In this paper, we propose a complementary viewpoint on the HEAR benchmark, which we name GEAR: Generative Evaluation of Audio Representations. The key idea behind GEAR is to generate a dataset of sounds with few independent factors of variability, analyze it with HEAR embeddings, and visualize it with an unsupervised manifold learning algorithm. Visual inspection reveals stark contrasts in the global structure of the nearest-neighbor graphs associated to logmelspec, Open-$L^3$, BYOL, CREPE, wav2vec2, GURA, and YAMNet. Although GEAR currently lacks mathematical refinement, we intend it as a proof of concept to show the potential of parametric audio synthesis in general-purpose machine listening research.

APA


Lostanlen, V., Yan, L. & Yang, X.. (2022). From HEAR to GEAR: Generative Evaluation of Audio Representations. HEAR: Holistic Evaluation of Audio Representations (NeurIPS 2021 Competition), in Proceedings of Machine Learning Research 166:48-64 Available from https://proceedings.mlr.press/v166/lostanlen22a.html.

Related Material

Download PDF