Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation

Dongchan Min; Dong Bok Lee; Eunho Yang; Sung Ju Hwang

Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation

Dongchan Min, Dong Bok Lee, Eunho Yang, Sung Ju Hwang

Proceedings of the 38th International Conference on Machine Learning, PMLR 139:7748-7759, 2021.

Abstract

With rapid progress in neural text-to-speech (TTS) models, personalized speech generation is now in high demand for many applications. For practical applicability, a TTS model should generate high-quality speech with only a few audio samples from the given speaker, that are also short in length. However, existing methods either require to fine-tune the model or achieve low adaptation quality without fine-tuning. In this work, we propose StyleSpeech, a new TTS model which not only synthesizes high-quality speech but also effectively adapts to new speakers. Specifically, we propose Style-Adaptive Layer Normalization (SALN) which aligns gain and bias of the text input according to the style extracted from a reference speech audio. With SALN, our model effectively synthesizes speech in the style of the target speaker even from a single speech audio. Furthermore, to enhance StyleSpeech’s adaptation to speech from new speakers, we extend it to Meta-StyleSpeech by introducing two discriminators trained with style prototypes, and performing episodic training. The experimental results show that our models generate high-quality speech which accurately follows the speaker’s voice with single short-duration (1-3 sec) speech audio, significantly outperforming baselines.

Cite this Paper

BibTeX

@InProceedings{pmlr-v139-min21b,
  title = 	 {Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation},
  author =       {Min, Dongchan and Lee, Dong Bok and Yang, Eunho and Hwang, Sung Ju},
  booktitle = 	 {Proceedings of the 38th International Conference on Machine Learning},
  pages = 	 {7748--7759},
  year = 	 {2021},
  editor = 	 {Meila, Marina and Zhang, Tong},
  volume = 	 {139},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {18--24 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v139/min21b/min21b.pdf},
  url = 	 {https://proceedings.mlr.press/v139/min21b.html},
  abstract = 	 {With rapid progress in neural text-to-speech (TTS) models, personalized speech generation is now in high demand for many applications. For practical applicability, a TTS model should generate high-quality speech with only a few audio samples from the given speaker, that are also short in length. However, existing methods either require to fine-tune the model or achieve low adaptation quality without fine-tuning. In this work, we propose StyleSpeech, a new TTS model which not only synthesizes high-quality speech but also effectively adapts to new speakers. Specifically, we propose Style-Adaptive Layer Normalization (SALN) which aligns gain and bias of the text input according to the style extracted from a reference speech audio. With SALN, our model effectively synthesizes speech in the style of the target speaker even from a single speech audio. Furthermore, to enhance StyleSpeech’s adaptation to speech from new speakers, we extend it to Meta-StyleSpeech by introducing two discriminators trained with style prototypes, and performing episodic training. The experimental results show that our models generate high-quality speech which accurately follows the speaker’s voice with single short-duration (1-3 sec) speech audio, significantly outperforming baselines.}
}

Endnote

%0 Conference Paper
%T Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation
%A Dongchan Min
%A Dong Bok Lee
%A Eunho Yang
%A Sung Ju Hwang
%B Proceedings of the 38th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2021
%E Marina Meila
%E Tong Zhang	
%F pmlr-v139-min21b
%I PMLR
%P 7748--7759
%U https://proceedings.mlr.press/v139/min21b.html
%V 139
%X With rapid progress in neural text-to-speech (TTS) models, personalized speech generation is now in high demand for many applications. For practical applicability, a TTS model should generate high-quality speech with only a few audio samples from the given speaker, that are also short in length. However, existing methods either require to fine-tune the model or achieve low adaptation quality without fine-tuning. In this work, we propose StyleSpeech, a new TTS model which not only synthesizes high-quality speech but also effectively adapts to new speakers. Specifically, we propose Style-Adaptive Layer Normalization (SALN) which aligns gain and bias of the text input according to the style extracted from a reference speech audio. With SALN, our model effectively synthesizes speech in the style of the target speaker even from a single speech audio. Furthermore, to enhance StyleSpeech’s adaptation to speech from new speakers, we extend it to Meta-StyleSpeech by introducing two discriminators trained with style prototypes, and performing episodic training. The experimental results show that our models generate high-quality speech which accurately follows the speaker’s voice with single short-duration (1-3 sec) speech audio, significantly outperforming baselines.

APA

Min, D., Lee, D.B., Yang, E. & Hwang, S.J.. (2021). Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:7748-7759 Available from https://proceedings.mlr.press/v139/min21b.html.

Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation

Abstract

Cite this Paper

Related Material