Masked Spectrogram Modeling using Masked Autoencoders for Learning General-purpose Audio Representation

Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, Kunio Kashino
HEAR: Holistic Evaluation of Audio Representations (NeurIPS 2021 Competition), PMLR 166:1-24, 2022.

Abstract

Recent general-purpose audio representations show state-of-the-art performance on various audio tasks. These representations are pre-trained by self-supervised learning methods that create training signals from the input. For example, typical audio contrastive learning uses temporal relationships among input sounds to create training signals, whereas some methods use a difference among input views created by data augmentations. However, these training signals do not provide information derived from the intact input sound, which we think is suboptimal for learning representation that describes the input as it is. In this paper, we seek to learn audio representations from the input itself as supervision using a pretext task of auto-encoding of masked spectrogram patches, Masked Spectrogram Modeling (MSM, a variant of Masked Image Modeling applied to audio spectrogram). To implement MSM, we use Masked Autoencoders (MAE), an image self-supervised learning method. MAE learns to efficiently encode the small number of visible patches into latent representations to carry essential information for reconstructing a large number of masked patches. While training, MAE minimizes the reconstruction error, which uses the input as training signal, consequently achieving our goal. We conducted experiments on our MSM using MAE (MSM-MAE) models under the evaluation benchmark of the HEAR 2021 NeurIPS Challenge. Our MSM-MAE models outperformed the HEAR 2021 Challenge results on seven out of 15 tasks (e.g., accuracies of 73.4% on CREMA-D and 85.8% on LibriCount), while showing top performance on other tasks where specialized models perform better. We also investigate how the design choices of MSM-MAE impact the performance and conduct qualitative analysis of visualization outcomes to gain an understanding of learned representations. We have made our code available online for further improvements and applications of the MSM framework.

Cite this Paper


BibTeX
@InProceedings{pmlr-v166-niizumi22a, title = {Masked Spectrogram Modeling using Masked Autoencoders for Learning General-purpose Audio Representation}, author = {Niizumi, Daisuke and Takeuchi, Daiki and Ohishi, Yasunori and Harada, Noboru and Kashino, Kunio}, booktitle = {HEAR: Holistic Evaluation of Audio Representations (NeurIPS 2021 Competition)}, pages = {1--24}, year = {2022}, editor = {Turian, Joseph and Schuller, Björn W. and Herremans, Dorien and Kirchoff, Katrin and Perera, Paola Garcia and Esling, Philippe}, volume = {166}, series = {Proceedings of Machine Learning Research}, month = {13--14 Dec}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v166/niizumi22a/niizumi22a.pdf}, url = {https://proceedings.mlr.press/v166/niizumi22a.html}, abstract = { Recent general-purpose audio representations show state-of-the-art performance on various audio tasks. These representations are pre-trained by self-supervised learning methods that create training signals from the input. For example, typical audio contrastive learning uses temporal relationships among input sounds to create training signals, whereas some methods use a difference among input views created by data augmentations. However, these training signals do not provide information derived from the intact input sound, which we think is suboptimal for learning representation that describes the input as it is. In this paper, we seek to learn audio representations from the input itself as supervision using a pretext task of auto-encoding of masked spectrogram patches, Masked Spectrogram Modeling (MSM, a variant of Masked Image Modeling applied to audio spectrogram). To implement MSM, we use Masked Autoencoders (MAE), an image self-supervised learning method. MAE learns to efficiently encode the small number of visible patches into latent representations to carry essential information for reconstructing a large number of masked patches. While training, MAE minimizes the reconstruction error, which uses the input as training signal, consequently achieving our goal. We conducted experiments on our MSM using MAE (MSM-MAE) models under the evaluation benchmark of the HEAR 2021 NeurIPS Challenge. Our MSM-MAE models outperformed the HEAR 2021 Challenge results on seven out of 15 tasks (e.g., accuracies of 73.4% on CREMA-D and 85.8% on LibriCount), while showing top performance on other tasks where specialized models perform better. We also investigate how the design choices of MSM-MAE impact the performance and conduct qualitative analysis of visualization outcomes to gain an understanding of learned representations. We have made our code available online for further improvements and applications of the MSM framework. } }
Endnote
%0 Conference Paper %T Masked Spectrogram Modeling using Masked Autoencoders for Learning General-purpose Audio Representation %A Daisuke Niizumi %A Daiki Takeuchi %A Yasunori Ohishi %A Noboru Harada %A Kunio Kashino %B HEAR: Holistic Evaluation of Audio Representations (NeurIPS 2021 Competition) %C Proceedings of Machine Learning Research %D 2022 %E Joseph Turian %E Björn W. Schuller %E Dorien Herremans %E Katrin Kirchoff %E Paola Garcia Perera %E Philippe Esling %F pmlr-v166-niizumi22a %I PMLR %P 1--24 %U https://proceedings.mlr.press/v166/niizumi22a.html %V 166 %X Recent general-purpose audio representations show state-of-the-art performance on various audio tasks. These representations are pre-trained by self-supervised learning methods that create training signals from the input. For example, typical audio contrastive learning uses temporal relationships among input sounds to create training signals, whereas some methods use a difference among input views created by data augmentations. However, these training signals do not provide information derived from the intact input sound, which we think is suboptimal for learning representation that describes the input as it is. In this paper, we seek to learn audio representations from the input itself as supervision using a pretext task of auto-encoding of masked spectrogram patches, Masked Spectrogram Modeling (MSM, a variant of Masked Image Modeling applied to audio spectrogram). To implement MSM, we use Masked Autoencoders (MAE), an image self-supervised learning method. MAE learns to efficiently encode the small number of visible patches into latent representations to carry essential information for reconstructing a large number of masked patches. While training, MAE minimizes the reconstruction error, which uses the input as training signal, consequently achieving our goal. We conducted experiments on our MSM using MAE (MSM-MAE) models under the evaluation benchmark of the HEAR 2021 NeurIPS Challenge. Our MSM-MAE models outperformed the HEAR 2021 Challenge results on seven out of 15 tasks (e.g., accuracies of 73.4% on CREMA-D and 85.8% on LibriCount), while showing top performance on other tasks where specialized models perform better. We also investigate how the design choices of MSM-MAE impact the performance and conduct qualitative analysis of visualization outcomes to gain an understanding of learned representations. We have made our code available online for further improvements and applications of the MSM framework.
APA
Niizumi, D., Takeuchi, D., Ohishi, Y., Harada, N. & Kashino, K.. (2022). Masked Spectrogram Modeling using Masked Autoencoders for Learning General-purpose Audio Representation. HEAR: Holistic Evaluation of Audio Representations (NeurIPS 2021 Competition), in Proceedings of Machine Learning Research 166:1-24 Available from https://proceedings.mlr.press/v166/niizumi22a.html.

Related Material