NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

Zeqian Ju; Yuancheng Wang; Kai Shen; Xu Tan; Detai Xin; Dongchao Yang; Eric Liu; Yichong Leng; Kaitao Song; Siliang Tang; Zhizheng Wu; Tao Qin; Xiangyang Li; Wei Ye; Shikun Zhang; Jiang Bian; Lei He; Jinyu Li; Sheng Zhao

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

Zeqian Ju, Yuancheng Wang, Kai Shen, Xu Tan, Detai Xin, Dongchao Yang, Eric Liu, Yichong Leng, Kaitao Song, Siliang Tang, Zhizheng Wu, Tao Qin, Xiangyang Li, Wei Ye, Shikun Zhang, Jiang Bian, Lei He, Jinyu Li, Sheng Zhao

Proceedings of the 41st International Conference on Machine Learning, PMLR 235:22605-22623, 2024.

Abstract

While recent large-scale text-to-speech (TTS) models have achieved significant progress, they still fall shorts in speech quality, similarity, and prosody. Considering that speech intricately encompasses various attributes (e.g., content, prosody, timbre, and acoustic details) that pose significant challenges for generation, a natural idea is to factorize speech into individual subspaces representing different attributes and generate them individually. Motivated by it, we propose a TTS system with novel factorized diffusion models to generate natural speech in a zero-shot way. Specifically, 1) we design a neural codec with factorized vector quantization (FVQ) to disentangle speech waveform into subspaces of content, prosody, timbre, and acoustic details; 2) we propose a factorized diffusion model, which generates attributes in each subspace following its corresponding prompt. With this factorization design, our method can effectively and efficiently model the intricate speech with disentangled subspaces in a divide-and-conquer way. Experimental results show that our method outperforms the state-of-the-art TTS systems on quality, similarity, prosody, and intelligibility.

Cite this Paper

BibTeX


@InProceedings{pmlr-v235-ju24b,
  title = 	 {{N}atural{S}peech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models},
  author =       {Ju, Zeqian and Wang, Yuancheng and Shen, Kai and Tan, Xu and Xin, Detai and Yang, Dongchao and Liu, Eric and Leng, Yichong and Song, Kaitao and Tang, Siliang and Wu, Zhizheng and Qin, Tao and Li, Xiangyang and Ye, Wei and Zhang, Shikun and Bian, Jiang and He, Lei and Li, Jinyu and Zhao, Sheng},
  booktitle = 	 {Proceedings of the 41st International Conference on Machine Learning},
  pages = 	 {22605--22623},
  year = 	 {2024},
  editor = 	 {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix},
  volume = 	 {235},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {21--27 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v235/main/assets/ju24b/ju24b.pdf},
  url = 	 {https://proceedings.mlr.press/v235/ju24b.html},
  abstract = 	 {While recent large-scale text-to-speech (TTS) models have achieved significant progress, they still fall shorts in speech quality, similarity, and prosody. Considering that speech intricately encompasses various attributes (e.g., content, prosody, timbre, and acoustic details) that pose significant challenges for generation, a natural idea is to factorize speech into individual subspaces representing different attributes and generate them individually. Motivated by it, we propose a TTS system with novel factorized diffusion models to generate natural speech in a zero-shot way. Specifically, 1) we design a neural codec with factorized vector quantization (FVQ) to disentangle speech waveform into subspaces of content, prosody, timbre, and acoustic details; 2) we propose a factorized diffusion model, which generates attributes in each subspace following its corresponding prompt. With this factorization design, our method can effectively and efficiently model the intricate speech with disentangled subspaces in a divide-and-conquer way. Experimental results show that our method outperforms the state-of-the-art TTS systems on quality, similarity, prosody, and intelligibility.}
}

Endnote

%0 Conference Paper
%T NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models
%A Zeqian Ju
%A Yuancheng Wang
%A Kai Shen
%A Xu Tan
%A Detai Xin
%A Dongchao Yang
%A Eric Liu
%A Yichong Leng
%A Kaitao Song
%A Siliang Tang
%A Zhizheng Wu
%A Tao Qin
%A Xiangyang Li
%A Wei Ye
%A Shikun Zhang
%A Jiang Bian
%A Lei He
%A Jinyu Li
%A Sheng Zhao
%B Proceedings of the 41st International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2024
%E Ruslan Salakhutdinov
%E Zico Kolter
%E Katherine Heller
%E Adrian Weller
%E Nuria Oliver
%E Jonathan Scarlett
%E Felix Berkenkamp	
%F pmlr-v235-ju24b
%I PMLR
%P 22605--22623
%U https://proceedings.mlr.press/v235/ju24b.html
%V 235
%X While recent large-scale text-to-speech (TTS) models have achieved significant progress, they still fall shorts in speech quality, similarity, and prosody. Considering that speech intricately encompasses various attributes (e.g., content, prosody, timbre, and acoustic details) that pose significant challenges for generation, a natural idea is to factorize speech into individual subspaces representing different attributes and generate them individually. Motivated by it, we propose a TTS system with novel factorized diffusion models to generate natural speech in a zero-shot way. Specifically, 1) we design a neural codec with factorized vector quantization (FVQ) to disentangle speech waveform into subspaces of content, prosody, timbre, and acoustic details; 2) we propose a factorized diffusion model, which generates attributes in each subspace following its corresponding prompt. With this factorization design, our method can effectively and efficiently model the intricate speech with disentangled subspaces in a divide-and-conquer way. Experimental results show that our method outperforms the state-of-the-art TTS systems on quality, similarity, prosody, and intelligibility.

APA


Ju, Z., Wang, Y., Shen, K., Tan, X., Xin, D., Yang, D., Liu, E., Leng, Y., Song, K., Tang, S., Wu, Z., Qin, T., Li, X., Ye, W., Zhang, S., Bian, J., He, L., Li, J. & Zhao, S.. (2024). NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:22605-22623 Available from https://proceedings.mlr.press/v235/ju24b.html.

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

Abstract

Cite this Paper

Related Material