EfficientTTS: An Efficient and High-Quality Text-to-Speech Architecture

Chenfeng Miao, Liang Shuang, Zhengchen Liu, Chen Minchuan, Jun Ma, Shaojun Wang, Jing Xiao
Proceedings of the 38th International Conference on Machine Learning, PMLR 139:7700-7709, 2021.

Abstract

In this work, we address the Text-to-Speech (TTS) task by proposing a non-autoregressive architecture called EfficientTTS. Unlike the dominant non-autoregressive TTS models, which are trained with the need of external aligners, EfficientTTS optimizes all its parameters with a stable, end-to-end training procedure, allowing for synthesizing high quality speech in a fast and efficient manner. EfficientTTS is motivated by a new monotonic alignment modeling approach, which specifies monotonic constraints to the sequence alignment with almost no increase of computation. By combining EfficientTTS with different feed-forward network structures, we develop a family of TTS models, including both text-to-melspectrogram and text-to-waveform networks. We experimentally show that the proposed models significantly outperform counterpart models such as Tacotron 2 and Glow-TTS in terms of speech quality, training efficiency and synthesis speed, while still producing the speeches of strong robustness and great diversity. In addition, we demonstrate that proposed approach can be easily extended to autoregressive models such as Tacotron 2.

Cite this Paper


BibTeX
@InProceedings{pmlr-v139-miao21a, title = {EfficientTTS: An Efficient and High-Quality Text-to-Speech Architecture}, author = {Miao, Chenfeng and Shuang, Liang and Liu, Zhengchen and Minchuan, Chen and Ma, Jun and Wang, Shaojun and Xiao, Jing}, booktitle = {Proceedings of the 38th International Conference on Machine Learning}, pages = {7700--7709}, year = {2021}, editor = {Meila, Marina and Zhang, Tong}, volume = {139}, series = {Proceedings of Machine Learning Research}, month = {18--24 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v139/miao21a/miao21a.pdf}, url = {https://proceedings.mlr.press/v139/miao21a.html}, abstract = {In this work, we address the Text-to-Speech (TTS) task by proposing a non-autoregressive architecture called EfficientTTS. Unlike the dominant non-autoregressive TTS models, which are trained with the need of external aligners, EfficientTTS optimizes all its parameters with a stable, end-to-end training procedure, allowing for synthesizing high quality speech in a fast and efficient manner. EfficientTTS is motivated by a new monotonic alignment modeling approach, which specifies monotonic constraints to the sequence alignment with almost no increase of computation. By combining EfficientTTS with different feed-forward network structures, we develop a family of TTS models, including both text-to-melspectrogram and text-to-waveform networks. We experimentally show that the proposed models significantly outperform counterpart models such as Tacotron 2 and Glow-TTS in terms of speech quality, training efficiency and synthesis speed, while still producing the speeches of strong robustness and great diversity. In addition, we demonstrate that proposed approach can be easily extended to autoregressive models such as Tacotron 2.} }
Endnote
%0 Conference Paper %T EfficientTTS: An Efficient and High-Quality Text-to-Speech Architecture %A Chenfeng Miao %A Liang Shuang %A Zhengchen Liu %A Chen Minchuan %A Jun Ma %A Shaojun Wang %A Jing Xiao %B Proceedings of the 38th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2021 %E Marina Meila %E Tong Zhang %F pmlr-v139-miao21a %I PMLR %P 7700--7709 %U https://proceedings.mlr.press/v139/miao21a.html %V 139 %X In this work, we address the Text-to-Speech (TTS) task by proposing a non-autoregressive architecture called EfficientTTS. Unlike the dominant non-autoregressive TTS models, which are trained with the need of external aligners, EfficientTTS optimizes all its parameters with a stable, end-to-end training procedure, allowing for synthesizing high quality speech in a fast and efficient manner. EfficientTTS is motivated by a new monotonic alignment modeling approach, which specifies monotonic constraints to the sequence alignment with almost no increase of computation. By combining EfficientTTS with different feed-forward network structures, we develop a family of TTS models, including both text-to-melspectrogram and text-to-waveform networks. We experimentally show that the proposed models significantly outperform counterpart models such as Tacotron 2 and Glow-TTS in terms of speech quality, training efficiency and synthesis speed, while still producing the speeches of strong robustness and great diversity. In addition, we demonstrate that proposed approach can be easily extended to autoregressive models such as Tacotron 2.
APA
Miao, C., Shuang, L., Liu, Z., Minchuan, C., Ma, J., Wang, S. & Xiao, J.. (2021). EfficientTTS: An Efficient and High-Quality Text-to-Speech Architecture. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:7700-7709 Available from https://proceedings.mlr.press/v139/miao21a.html.

Related Material