Non-Autoregressive Neural Text-to-Speech

Kainan Peng, Wei Ping, Zhao Song, Kexin Zhao
Proceedings of the 37th International Conference on Machine Learning, PMLR 119:7586-7598, 2020.

Abstract

In this work, we propose ParaNet, a non-autoregressive seq2seq model that converts text to spectrogram. It is fully convolutional and brings 46.7 times speed-up over the lightweight Deep Voice 3 at synthesis, while obtaining reasonably good speech quality. ParaNet also produces stable alignment between text and speech on the challenging test sentences by iteratively improving the attention in a layer-by-layer manner. Furthermore, we build the parallel text-to-speech system by applying various parallel neural vocoders, which can synthesize speech from text through a single feed-forward pass. We also explore a novel VAE-based approach to train the inverse autoregressive flow (IAF) based parallel vocoder from scratch, which avoids the need for distillation from a separately trained WaveNet as previous work.

Cite this Paper


BibTeX
@InProceedings{pmlr-v119-peng20a, title = {Non-Autoregressive Neural Text-to-Speech}, author = {Peng, Kainan and Ping, Wei and Song, Zhao and Zhao, Kexin}, booktitle = {Proceedings of the 37th International Conference on Machine Learning}, pages = {7586--7598}, year = {2020}, editor = {III, Hal Daumé and Singh, Aarti}, volume = {119}, series = {Proceedings of Machine Learning Research}, month = {13--18 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v119/peng20a/peng20a.pdf}, url = {https://proceedings.mlr.press/v119/peng20a.html}, abstract = {In this work, we propose ParaNet, a non-autoregressive seq2seq model that converts text to spectrogram. It is fully convolutional and brings 46.7 times speed-up over the lightweight Deep Voice 3 at synthesis, while obtaining reasonably good speech quality. ParaNet also produces stable alignment between text and speech on the challenging test sentences by iteratively improving the attention in a layer-by-layer manner. Furthermore, we build the parallel text-to-speech system by applying various parallel neural vocoders, which can synthesize speech from text through a single feed-forward pass. We also explore a novel VAE-based approach to train the inverse autoregressive flow (IAF) based parallel vocoder from scratch, which avoids the need for distillation from a separately trained WaveNet as previous work.} }
Endnote
%0 Conference Paper %T Non-Autoregressive Neural Text-to-Speech %A Kainan Peng %A Wei Ping %A Zhao Song %A Kexin Zhao %B Proceedings of the 37th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2020 %E Hal Daumé III %E Aarti Singh %F pmlr-v119-peng20a %I PMLR %P 7586--7598 %U https://proceedings.mlr.press/v119/peng20a.html %V 119 %X In this work, we propose ParaNet, a non-autoregressive seq2seq model that converts text to spectrogram. It is fully convolutional and brings 46.7 times speed-up over the lightweight Deep Voice 3 at synthesis, while obtaining reasonably good speech quality. ParaNet also produces stable alignment between text and speech on the challenging test sentences by iteratively improving the attention in a layer-by-layer manner. Furthermore, we build the parallel text-to-speech system by applying various parallel neural vocoders, which can synthesize speech from text through a single feed-forward pass. We also explore a novel VAE-based approach to train the inverse autoregressive flow (IAF) based parallel vocoder from scratch, which avoids the need for distillation from a separately trained WaveNet as previous work.
APA
Peng, K., Ping, W., Song, Z. & Zhao, K.. (2020). Non-Autoregressive Neural Text-to-Speech. Proceedings of the 37th International Conference on Machine Learning, in Proceedings of Machine Learning Research 119:7586-7598 Available from https://proceedings.mlr.press/v119/peng20a.html.

Related Material