Efficient Neural Audio Synthesis

Nal Kalchbrenner; Erich Elsen; Karen Simonyan; Seb Noury; Norman Casagrande; Edward Lockhart; Florian Stimberg; Aaron Oord; Sander Dieleman; Koray Kavukcuoglu

Efficient Neural Audio Synthesis

Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg, Aaron Oord, Sander Dieleman, Koray Kavukcuoglu

Proceedings of the 35th International Conference on Machine Learning, PMLR 80:2410-2419, 2018.

Abstract

Sequential models achieve state-of-the-art results in audio, visual and textual domains with respect to both estimating the data distribution and generating desired samples. Efficient sampling for this class of models at the cost of little to no loss in quality has however remained an elusive problem. With a focus on text-to-speech synthesis, we describe a set of general techniques for reducing sampling time while maintaining high output quality. We first describe a single-layer recurrent neural network, the WaveRNN, with a dual softmax layer that matches the quality of the state-of-the-art WaveNet model. The compact form of the network makes it possible to generate 24 kHz 16-bit audio 4 times faster than real time on a GPU. Secondly, we apply a weight pruning technique to reduce the number of weights in the WaveRNN. We find that, for a constant number of parameters, large sparse networks perform better than small dense networks and this relationship holds past sparsity levels of more than 96%. The small number of weights in a Sparse WaveRNN makes it possible to sample high-fidelity audio on a mobile phone CPU in real time. Finally, we describe a new dependency scheme for sampling that lets us trade a constant number of non-local, distant dependencies for the ability to generate samples in batches. The Batch WaveRNN produces 8 samples per step without loss of quality and offers orthogonal ways of further increasing sampling efficiency.

Cite this Paper

BibTeX

@InProceedings{pmlr-v80-kalchbrenner18a,
  title = 	 {Efficient Neural Audio Synthesis},
  author =       {Kalchbrenner, Nal and Elsen, Erich and Simonyan, Karen and Noury, Seb and Casagrande, Norman and Lockhart, Edward and Stimberg, Florian and van den Oord, Aaron and Dieleman, Sander and Kavukcuoglu, Koray},
  booktitle = 	 {Proceedings of the 35th International Conference on Machine Learning},
  pages = 	 {2410--2419},
  year = 	 {2018},
  editor = 	 {Dy, Jennifer and Krause, Andreas},
  volume = 	 {80},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {10--15 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v80/kalchbrenner18a/kalchbrenner18a.pdf},
  url = 	 {https://proceedings.mlr.press/v80/kalchbrenner18a.html},
  abstract = 	 {Sequential models achieve state-of-the-art results in audio, visual and textual domains with respect to both estimating the data distribution and generating desired samples. Efficient sampling for this class of models at the cost of little to no loss in quality has however remained an elusive problem. With a focus on text-to-speech synthesis, we describe a set of general techniques for reducing sampling time while maintaining high output quality. We first describe a single-layer recurrent neural network, the WaveRNN, with a dual softmax layer that matches the quality of the state-of-the-art WaveNet model. The compact form of the network makes it possible to generate 24 kHz 16-bit audio 4 times faster than real time on a GPU. Secondly, we apply a weight pruning technique to reduce the number of weights in the WaveRNN. We find that, for a constant number of parameters, large sparse networks perform better than small dense networks and this relationship holds past sparsity levels of more than 96%. The small number of weights in a Sparse WaveRNN makes it possible to sample high-fidelity audio on a mobile phone CPU in real time. Finally, we describe a new dependency scheme for sampling that lets us trade a constant number of non-local, distant dependencies for the ability to generate samples in batches. The Batch WaveRNN produces 8 samples per step without loss of quality and offers orthogonal ways of further increasing sampling efficiency.}
}

Endnote

%0 Conference Paper
%T Efficient Neural Audio Synthesis
%A Nal Kalchbrenner
%A Erich Elsen
%A Karen Simonyan
%A Seb Noury
%A Norman Casagrande
%A Edward Lockhart
%A Florian Stimberg
%A Aaron Oord
%A Sander Dieleman
%A Koray Kavukcuoglu
%B Proceedings of the 35th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2018
%E Jennifer Dy
%E Andreas Krause	
%F pmlr-v80-kalchbrenner18a
%I PMLR
%P 2410--2419
%U https://proceedings.mlr.press/v80/kalchbrenner18a.html
%V 80
%X Sequential models achieve state-of-the-art results in audio, visual and textual domains with respect to both estimating the data distribution and generating desired samples. Efficient sampling for this class of models at the cost of little to no loss in quality has however remained an elusive problem. With a focus on text-to-speech synthesis, we describe a set of general techniques for reducing sampling time while maintaining high output quality. We first describe a single-layer recurrent neural network, the WaveRNN, with a dual softmax layer that matches the quality of the state-of-the-art WaveNet model. The compact form of the network makes it possible to generate 24 kHz 16-bit audio 4 times faster than real time on a GPU. Secondly, we apply a weight pruning technique to reduce the number of weights in the WaveRNN. We find that, for a constant number of parameters, large sparse networks perform better than small dense networks and this relationship holds past sparsity levels of more than 96%. The small number of weights in a Sparse WaveRNN makes it possible to sample high-fidelity audio on a mobile phone CPU in real time. Finally, we describe a new dependency scheme for sampling that lets us trade a constant number of non-local, distant dependencies for the ability to generate samples in batches. The Batch WaveRNN produces 8 samples per step without loss of quality and offers orthogonal ways of further increasing sampling efficiency.

APA

Kalchbrenner, N., Elsen, E., Simonyan, K., Noury, S., Casagrande, N., Lockhart, E., Stimberg, F., Oord, A., Dieleman, S. & Kavukcuoglu, K.. (2018). Efficient Neural Audio Synthesis. Proceedings of the 35th International Conference on Machine Learning, in Proceedings of Machine Learning Research 80:2410-2419 Available from https://proceedings.mlr.press/v80/kalchbrenner18a.html.

Related Material

Download PDF