DFlow: A Generative Model Combining Denoising AutoEncoder and Normalizing Flow for High Fidelity Waveform Generation

Chenfeng Miao, Qingying Zhu, Minchuan Chen, Wei Hu, Zijian Li, Shaojun Wang, Jing Xiao
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:35590-35606, 2024.

Abstract

In this work, we present DFlow, a novel generative framework that combines Normalizing Flow (NF) with a Denoising AutoEncoder (DAE), for high-fidelity waveform generation. With a tactfully designed structure, DFlow seamlessly integrates the capabilities of both NF and DAE, resulting in a significantly improved performance compared to the standard NF models. Experimental results showcase DFlow’s superiority, achieving the highest MOS score among the existing methods on commonly used datasets and the fastest synthesis speed among all likelihood models. We further demonstrate the generalization ability of DFlow by generating high-quality out-of-distribution audio samples, such as singing and music audio. Additionally, we extend the model capacity of DFlow by scaling up both the model size and training set size. Our large-scale universal vocoder, DFlow-XL, achieves highly competitive performance against the best universal vocoder, BigVGAN.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-miao24d, title = {{DF}low: A Generative Model Combining Denoising {A}uto{E}ncoder and Normalizing Flow for High Fidelity Waveform Generation}, author = {Miao, Chenfeng and Zhu, Qingying and Chen, Minchuan and Hu, Wei and Li, Zijian and Wang, Shaojun and Xiao, Jing}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {35590--35606}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/miao24d/miao24d.pdf}, url = {https://proceedings.mlr.press/v235/miao24d.html}, abstract = {In this work, we present DFlow, a novel generative framework that combines Normalizing Flow (NF) with a Denoising AutoEncoder (DAE), for high-fidelity waveform generation. With a tactfully designed structure, DFlow seamlessly integrates the capabilities of both NF and DAE, resulting in a significantly improved performance compared to the standard NF models. Experimental results showcase DFlow’s superiority, achieving the highest MOS score among the existing methods on commonly used datasets and the fastest synthesis speed among all likelihood models. We further demonstrate the generalization ability of DFlow by generating high-quality out-of-distribution audio samples, such as singing and music audio. Additionally, we extend the model capacity of DFlow by scaling up both the model size and training set size. Our large-scale universal vocoder, DFlow-XL, achieves highly competitive performance against the best universal vocoder, BigVGAN.} }
Endnote
%0 Conference Paper %T DFlow: A Generative Model Combining Denoising AutoEncoder and Normalizing Flow for High Fidelity Waveform Generation %A Chenfeng Miao %A Qingying Zhu %A Minchuan Chen %A Wei Hu %A Zijian Li %A Shaojun Wang %A Jing Xiao %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-miao24d %I PMLR %P 35590--35606 %U https://proceedings.mlr.press/v235/miao24d.html %V 235 %X In this work, we present DFlow, a novel generative framework that combines Normalizing Flow (NF) with a Denoising AutoEncoder (DAE), for high-fidelity waveform generation. With a tactfully designed structure, DFlow seamlessly integrates the capabilities of both NF and DAE, resulting in a significantly improved performance compared to the standard NF models. Experimental results showcase DFlow’s superiority, achieving the highest MOS score among the existing methods on commonly used datasets and the fastest synthesis speed among all likelihood models. We further demonstrate the generalization ability of DFlow by generating high-quality out-of-distribution audio samples, such as singing and music audio. Additionally, we extend the model capacity of DFlow by scaling up both the model size and training set size. Our large-scale universal vocoder, DFlow-XL, achieves highly competitive performance against the best universal vocoder, BigVGAN.
APA
Miao, C., Zhu, Q., Chen, M., Hu, W., Li, Z., Wang, S. & Xiao, J.. (2024). DFlow: A Generative Model Combining Denoising AutoEncoder and Normalizing Flow for High Fidelity Waveform Generation. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:35590-35606 Available from https://proceedings.mlr.press/v235/miao24d.html.

Related Material