Speech Self-Supervised Learning Using Diffusion Model Synthetic Data

Heting Gao; Kaizhi Qian; Junrui Ni; Chuang Gan; Mark A. Hasegawa-Johnson; Shiyu Chang; Yang Zhang

Speech Self-Supervised Learning Using Diffusion Model Synthetic Data

Heting Gao, Kaizhi Qian, Junrui Ni, Chuang Gan, Mark A. Hasegawa-Johnson, Shiyu Chang, Yang Zhang

Proceedings of the 41st International Conference on Machine Learning, PMLR 235:14790-14810, 2024.

Abstract

While self-supervised learning (SSL) in speech has greatly reduced the reliance of speech processing systems on annotated corpora, the success of SSL still hinges on the availability of a large-scale unannotated corpus, which is still often impractical for many low-resource languages or under privacy concerns. Some existing work seeks to alleviate the problem by data augmentation, but most works are confined to introducing perturbations to real speech and do not introduce new variations in speech prosody, speakers, and speech content, which are important for SSL. Motivated by the recent finding that diffusion models have superior capabilities for modeling data distributions, we propose DiffS4L, a pretraining scheme that augments the limited unannotated data with synthetic data with different levels of variations, generated by a diffusion model trained on the limited unannotated data. Finally, an SSL model is pre-trained on the real and the synthetic speech. Our experiments show that DiffS4L can significantly improve the performance of SSL models, such as reducing the WER of the HuBERT pretrained model by 6.26 percentage points in the English ASR task. Notably, we find that the synthetic speech with all levels of variations, i.e. new prosody, new speakers, and even new content (despite the new content being mostly babble), accounts for significant performance improvement. The code is available at github.com/Hertin/DiffS4L.

Cite this Paper

BibTeX


@InProceedings{pmlr-v235-gao24j,
  title = 	 {Speech Self-Supervised Learning Using Diffusion Model Synthetic Data},
  author =       {Gao, Heting and Qian, Kaizhi and Ni, Junrui and Gan, Chuang and Hasegawa-Johnson, Mark A. and Chang, Shiyu and Zhang, Yang},
  booktitle = 	 {Proceedings of the 41st International Conference on Machine Learning},
  pages = 	 {14790--14810},
  year = 	 {2024},
  editor = 	 {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix},
  volume = 	 {235},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {21--27 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v235/main/assets/gao24j/gao24j.pdf},
  url = 	 {https://proceedings.mlr.press/v235/gao24j.html},
  abstract = 	 {While self-supervised learning (SSL) in speech has greatly reduced the reliance of speech processing systems on annotated corpora, the success of SSL still hinges on the availability of a large-scale unannotated corpus, which is still often impractical for many low-resource languages or under privacy concerns. Some existing work seeks to alleviate the problem by data augmentation, but most works are confined to introducing perturbations to real speech and do not introduce new variations in speech prosody, speakers, and speech content, which are important for SSL. Motivated by the recent finding that diffusion models have superior capabilities for modeling data distributions, we propose DiffS4L, a pretraining scheme that augments the limited unannotated data with synthetic data with different levels of variations, generated by a diffusion model trained on the limited unannotated data. Finally, an SSL model is pre-trained on the real and the synthetic speech. Our experiments show that DiffS4L can significantly improve the performance of SSL models, such as reducing the WER of the HuBERT pretrained model by 6.26 percentage points in the English ASR task. Notably, we find that the synthetic speech with all levels of variations, i.e. new prosody, new speakers, and even new content (despite the new content being mostly babble), accounts for significant performance improvement. The code is available at github.com/Hertin/DiffS4L.}
}

Endnote

%0 Conference Paper
%T Speech Self-Supervised Learning Using Diffusion Model Synthetic Data
%A Heting Gao
%A Kaizhi Qian
%A Junrui Ni
%A Chuang Gan
%A Mark A. Hasegawa-Johnson
%A Shiyu Chang
%A Yang Zhang
%B Proceedings of the 41st International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2024
%E Ruslan Salakhutdinov
%E Zico Kolter
%E Katherine Heller
%E Adrian Weller
%E Nuria Oliver
%E Jonathan Scarlett
%E Felix Berkenkamp	
%F pmlr-v235-gao24j
%I PMLR
%P 14790--14810
%U https://proceedings.mlr.press/v235/gao24j.html
%V 235
%X While self-supervised learning (SSL) in speech has greatly reduced the reliance of speech processing systems on annotated corpora, the success of SSL still hinges on the availability of a large-scale unannotated corpus, which is still often impractical for many low-resource languages or under privacy concerns. Some existing work seeks to alleviate the problem by data augmentation, but most works are confined to introducing perturbations to real speech and do not introduce new variations in speech prosody, speakers, and speech content, which are important for SSL. Motivated by the recent finding that diffusion models have superior capabilities for modeling data distributions, we propose DiffS4L, a pretraining scheme that augments the limited unannotated data with synthetic data with different levels of variations, generated by a diffusion model trained on the limited unannotated data. Finally, an SSL model is pre-trained on the real and the synthetic speech. Our experiments show that DiffS4L can significantly improve the performance of SSL models, such as reducing the WER of the HuBERT pretrained model by 6.26 percentage points in the English ASR task. Notably, we find that the synthetic speech with all levels of variations, i.e. new prosody, new speakers, and even new content (despite the new content being mostly babble), accounts for significant performance improvement. The code is available at github.com/Hertin/DiffS4L.

APA


Gao, H., Qian, K., Ni, J., Gan, C., Hasegawa-Johnson, M.A., Chang, S. & Zhang, Y.. (2024). Speech Self-Supervised Learning Using Diffusion Model Synthetic Data. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:14790-14810 Available from https://proceedings.mlr.press/v235/gao24j.html.

Speech Self-Supervised Learning Using Diffusion Model Synthetic Data

Abstract

Cite this Paper

Related Material