WaveFlow: A Compact Flow-based Model for Raw Audio

Wei Ping; Kainan Peng; Kexin Zhao; Zhao Song

WaveFlow: A Compact Flow-based Model for Raw Audio

Wei Ping, Kainan Peng, Kexin Zhao, Zhao Song

Proceedings of the 37th International Conference on Machine Learning, PMLR 119:7706-7716, 2020.

Abstract

In this work, we propose WaveFlow, a small-footprint generative flow for raw audio, which is directly trained with maximum likelihood. It handles the long-range structure of 1-D waveform with a dilated 2-D convolutional architecture, while modeling the local variations using expressive autoregressive functions. WaveFlow provides a unified view of likelihood-based models for 1-D data, including WaveNet and WaveGlow as special cases. It generates high-fidelity speech as WaveNet, while synthesizing several orders of magnitude faster as it only requires a few sequential steps to generate very long waveforms with hundreds of thousands of time-steps. Furthermore, it can significantly reduce the likelihood gap that has existed between autoregressive models and flow-based models for efficient synthesis. Finally, our small-footprint WaveFlow has only 5.91M parameters, which is 15{\texttimes} smaller than WaveGlow. It can generate 22.05 kHz high-fidelity audio 42.6{\texttimes} faster than real-time (at a rate of 939.3 kHz) on a V100 GPU without engineered inference kernels.

Cite this Paper

BibTeX


@InProceedings{pmlr-v119-ping20a,
  title = 	 {{W}ave{F}low: A Compact Flow-based Model for Raw Audio},
  author =       {Ping, Wei and Peng, Kainan and Zhao, Kexin and Song, Zhao},
  booktitle = 	 {Proceedings of the 37th International Conference on Machine Learning},
  pages = 	 {7706--7716},
  year = 	 {2020},
  editor = 	 {III, Hal Daumé and Singh, Aarti},
  volume = 	 {119},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--18 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v119/ping20a/ping20a.pdf},
  url = 	 {https://proceedings.mlr.press/v119/ping20a.html},
  abstract = 	 {In this work, we propose WaveFlow, a small-footprint generative flow for raw audio, which is directly trained with maximum likelihood. It handles the long-range structure of 1-D waveform with a dilated 2-D convolutional architecture, while modeling the local variations using expressive autoregressive functions. WaveFlow provides a unified view of likelihood-based models for 1-D data, including WaveNet and WaveGlow as special cases. It generates high-fidelity speech as WaveNet, while synthesizing several orders of magnitude faster as it only requires a few sequential steps to generate very long waveforms with hundreds of thousands of time-steps. Furthermore, it can significantly reduce the likelihood gap that has existed between autoregressive models and flow-based models for efficient synthesis. Finally, our small-footprint WaveFlow has only 5.91M parameters, which is 15{\texttimes} smaller than WaveGlow. It can generate 22.05 kHz high-fidelity audio 42.6{\texttimes} faster than real-time (at a rate of 939.3 kHz) on a V100 GPU without engineered inference kernels.}
}

Endnote

%0 Conference Paper
%T WaveFlow: A Compact Flow-based Model for Raw Audio
%A Wei Ping
%A Kainan Peng
%A Kexin Zhao
%A Zhao Song
%B Proceedings of the 37th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2020
%E Hal Daumé III
%E Aarti Singh	
%F pmlr-v119-ping20a
%I PMLR
%P 7706--7716
%U https://proceedings.mlr.press/v119/ping20a.html
%V 119
%X In this work, we propose WaveFlow, a small-footprint generative flow for raw audio, which is directly trained with maximum likelihood. It handles the long-range structure of 1-D waveform with a dilated 2-D convolutional architecture, while modeling the local variations using expressive autoregressive functions. WaveFlow provides a unified view of likelihood-based models for 1-D data, including WaveNet and WaveGlow as special cases. It generates high-fidelity speech as WaveNet, while synthesizing several orders of magnitude faster as it only requires a few sequential steps to generate very long waveforms with hundreds of thousands of time-steps. Furthermore, it can significantly reduce the likelihood gap that has existed between autoregressive models and flow-based models for efficient synthesis. Finally, our small-footprint WaveFlow has only 5.91M parameters, which is 15{\texttimes} smaller than WaveGlow. It can generate 22.05 kHz high-fidelity audio 42.6{\texttimes} faster than real-time (at a rate of 939.3 kHz) on a V100 GPU without engineered inference kernels.

APA


Ping, W., Peng, K., Zhao, K. & Song, Z.. (2020). WaveFlow: A Compact Flow-based Model for Raw Audio. Proceedings of the 37th International Conference on Machine Learning, in Proceedings of Machine Learning Research 119:7706-7716 Available from https://proceedings.mlr.press/v119/ping20a.html.

WaveFlow: A Compact Flow-based Model for Raw Audio

Abstract

Cite this Paper

Related Material