Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis

Yuxuan Wang; Daisy Stanton; Yu Zhang; RJ-Skerry Ryan; Eric Battenberg; Joel Shor; Ying Xiao; Ye Jia; Fei Ren; Rif A. Saurous

Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis

Yuxuan Wang, Daisy Stanton, Yu Zhang, RJ-Skerry Ryan, Eric Battenberg, Joel Shor, Ying Xiao, Ye Jia, Fei Ren, Rif A. Saurous

Proceedings of the 35th International Conference on Machine Learning, PMLR 80:5180-5189, 2018.

Abstract

In this work, we propose “global style tokens” (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system. The embeddings are trained with no explicit labels, yet learn to model a large range of acoustic expressiveness. GSTs lead to a rich set of significant results. The soft interpretable “labels” they generate can be used to control synthesis in novel ways, such as varying speed and speaking style – independently of the text content. They can also be used for style transfer, replicating the speaking style of a single audio clip across an entire long-form text corpus. When trained on noisy, unlabeled found data, GSTs learn to factorize noise and speaker identity, providing a path towards highly scalable but robust speech synthesis.

Cite this Paper

BibTeX

@InProceedings{pmlr-v80-wang18h,
  title = 	 {Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis},
  author =       {Wang, Yuxuan and Stanton, Daisy and Zhang, Yu and Ryan, RJ-Skerry and Battenberg, Eric and Shor, Joel and Xiao, Ying and Jia, Ye and Ren, Fei and Saurous, Rif A.},
  booktitle = 	 {Proceedings of the 35th International Conference on Machine Learning},
  pages = 	 {5180--5189},
  year = 	 {2018},
  editor = 	 {Dy, Jennifer and Krause, Andreas},
  volume = 	 {80},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {10--15 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v80/wang18h/wang18h.pdf},
  url = 	 {https://proceedings.mlr.press/v80/wang18h.html},
  abstract = 	 {In this work, we propose “global style tokens” (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system. The embeddings are trained with no explicit labels, yet learn to model a large range of acoustic expressiveness. GSTs lead to a rich set of significant results. The soft interpretable “labels” they generate can be used to control synthesis in novel ways, such as varying speed and speaking style – independently of the text content. They can also be used for style transfer, replicating the speaking style of a single audio clip across an entire long-form text corpus. When trained on noisy, unlabeled found data, GSTs learn to factorize noise and speaker identity, providing a path towards highly scalable but robust speech synthesis.}
}

Endnote

%0 Conference Paper
%T Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis
%A Yuxuan Wang
%A Daisy Stanton
%A Yu Zhang
%A RJ-Skerry Ryan
%A Eric Battenberg
%A Joel Shor
%A Ying Xiao
%A Ye Jia
%A Fei Ren
%A Rif A. Saurous
%B Proceedings of the 35th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2018
%E Jennifer Dy
%E Andreas Krause	
%F pmlr-v80-wang18h
%I PMLR
%P 5180--5189
%U https://proceedings.mlr.press/v80/wang18h.html
%V 80
%X In this work, we propose “global style tokens” (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system. The embeddings are trained with no explicit labels, yet learn to model a large range of acoustic expressiveness. GSTs lead to a rich set of significant results. The soft interpretable “labels” they generate can be used to control synthesis in novel ways, such as varying speed and speaking style – independently of the text content. They can also be used for style transfer, replicating the speaking style of a single audio clip across an entire long-form text corpus. When trained on noisy, unlabeled found data, GSTs learn to factorize noise and speaker identity, providing a path towards highly scalable but robust speech synthesis.

APA

Wang, Y., Stanton, D., Zhang, Y., Ryan, R., Battenberg, E., Shor, J., Xiao, Y., Jia, Y., Ren, F. & Saurous, R.A.. (2018). Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis. Proceedings of the 35th International Conference on Machine Learning, in Proceedings of Machine Learning Research 80:5180-5189 Available from https://proceedings.mlr.press/v80/wang18h.html.

Related Material

Download PDF