Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron

RJ Skerry-Ryan, Eric Battenberg, Ying Xiao, Yuxuan Wang, Daisy Stanton, Joel Shor, Ron Weiss, Rob Clark, Rif A. Saurous
Proceedings of the 35th International Conference on Machine Learning, PMLR 80:4693-4702, 2018.

Abstract

We present an extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from a reference acoustic representation containing the desired prosody. We show that conditioning Tacotron on this learned embedding space results in synthesized audio that matches the prosody of the reference signal with fine time detail even when the reference and synthesis speakers are different. Additionally, we show that a reference prosody embedding can be used to synthesize text that is different from that of the reference utterance. We define several quantitative and subjective metrics for evaluating prosody transfer, and report results with accompanying audio samples from single-speaker and 44-speaker Tacotron models on a prosody transfer task.

Cite this Paper


BibTeX
@InProceedings{pmlr-v80-skerry-ryan18a, title = {Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron}, author = {Skerry-Ryan, RJ and Battenberg, Eric and Xiao, Ying and Wang, Yuxuan and Stanton, Daisy and Shor, Joel and Weiss, Ron and Clark, Rob and Saurous, Rif A.}, booktitle = {Proceedings of the 35th International Conference on Machine Learning}, pages = {4693--4702}, year = {2018}, editor = {Dy, Jennifer and Krause, Andreas}, volume = {80}, series = {Proceedings of Machine Learning Research}, month = {10--15 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v80/skerry-ryan18a/skerry-ryan18a.pdf}, url = {https://proceedings.mlr.press/v80/skerry-ryan18a.html}, abstract = {We present an extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from a reference acoustic representation containing the desired prosody. We show that conditioning Tacotron on this learned embedding space results in synthesized audio that matches the prosody of the reference signal with fine time detail even when the reference and synthesis speakers are different. Additionally, we show that a reference prosody embedding can be used to synthesize text that is different from that of the reference utterance. We define several quantitative and subjective metrics for evaluating prosody transfer, and report results with accompanying audio samples from single-speaker and 44-speaker Tacotron models on a prosody transfer task.} }
Endnote
%0 Conference Paper %T Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron %A RJ Skerry-Ryan %A Eric Battenberg %A Ying Xiao %A Yuxuan Wang %A Daisy Stanton %A Joel Shor %A Ron Weiss %A Rob Clark %A Rif A. Saurous %B Proceedings of the 35th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2018 %E Jennifer Dy %E Andreas Krause %F pmlr-v80-skerry-ryan18a %I PMLR %P 4693--4702 %U https://proceedings.mlr.press/v80/skerry-ryan18a.html %V 80 %X We present an extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from a reference acoustic representation containing the desired prosody. We show that conditioning Tacotron on this learned embedding space results in synthesized audio that matches the prosody of the reference signal with fine time detail even when the reference and synthesis speakers are different. Additionally, we show that a reference prosody embedding can be used to synthesize text that is different from that of the reference utterance. We define several quantitative and subjective metrics for evaluating prosody transfer, and report results with accompanying audio samples from single-speaker and 44-speaker Tacotron models on a prosody transfer task.
APA
Skerry-Ryan, R., Battenberg, E., Xiao, Y., Wang, Y., Stanton, D., Shor, J., Weiss, R., Clark, R. & Saurous, R.A.. (2018). Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron. Proceedings of the 35th International Conference on Machine Learning, in Proceedings of Machine Learning Research 80:4693-4702 Available from https://proceedings.mlr.press/v80/skerry-ryan18a.html.

Related Material