Fast Timing-Conditioned Latent Audio Diffusion

Zach Evans, Cj Carr, Josiah Taylor, Scott H. Hawley, Jordi Pons
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:12652-12665, 2024.

Abstract

Generating long-form 44.1kHz stereo audio from text prompts can be computationally demanding. Further, most previous works do not tackle that music and sound effects naturally vary in their duration. Our research focuses on the efficient generation of long-form, variable-length stereo music and sounds at 44.1kHz using text prompts with a generative model. It is based on latent diffusion, with its latent defined by a fully-convolutional variational autoencoder. The generative model is conditioned on text prompts as well as timing embeddings, allowing for fine control over both the content and length of the generated music and sounds. It is capable of rendering stereo signals of up to 95 sec at 44.1kHz in 8 sec on an A100 GPU. Despite its compute efficiency and fast inference, the proposed model is one of the best in two public text-to-music and -audio benchmarks and, differently from state-of-the-art models, can generate music with structure and stereo sounds.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-evans24a, title = {Fast Timing-Conditioned Latent Audio Diffusion}, author = {Evans, Zach and Carr, Cj and Taylor, Josiah and Hawley, Scott H. and Pons, Jordi}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {12652--12665}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/evans24a/evans24a.pdf}, url = {https://proceedings.mlr.press/v235/evans24a.html}, abstract = {Generating long-form 44.1kHz stereo audio from text prompts can be computationally demanding. Further, most previous works do not tackle that music and sound effects naturally vary in their duration. Our research focuses on the efficient generation of long-form, variable-length stereo music and sounds at 44.1kHz using text prompts with a generative model. It is based on latent diffusion, with its latent defined by a fully-convolutional variational autoencoder. The generative model is conditioned on text prompts as well as timing embeddings, allowing for fine control over both the content and length of the generated music and sounds. It is capable of rendering stereo signals of up to 95 sec at 44.1kHz in 8 sec on an A100 GPU. Despite its compute efficiency and fast inference, the proposed model is one of the best in two public text-to-music and -audio benchmarks and, differently from state-of-the-art models, can generate music with structure and stereo sounds.} }
Endnote
%0 Conference Paper %T Fast Timing-Conditioned Latent Audio Diffusion %A Zach Evans %A Cj Carr %A Josiah Taylor %A Scott H. Hawley %A Jordi Pons %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-evans24a %I PMLR %P 12652--12665 %U https://proceedings.mlr.press/v235/evans24a.html %V 235 %X Generating long-form 44.1kHz stereo audio from text prompts can be computationally demanding. Further, most previous works do not tackle that music and sound effects naturally vary in their duration. Our research focuses on the efficient generation of long-form, variable-length stereo music and sounds at 44.1kHz using text prompts with a generative model. It is based on latent diffusion, with its latent defined by a fully-convolutional variational autoencoder. The generative model is conditioned on text prompts as well as timing embeddings, allowing for fine control over both the content and length of the generated music and sounds. It is capable of rendering stereo signals of up to 95 sec at 44.1kHz in 8 sec on an A100 GPU. Despite its compute efficiency and fast inference, the proposed model is one of the best in two public text-to-music and -audio benchmarks and, differently from state-of-the-art models, can generate music with structure and stereo sounds.
APA
Evans, Z., Carr, C., Taylor, J., Hawley, S.H. & Pons, J.. (2024). Fast Timing-Conditioned Latent Audio Diffusion. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:12652-12665 Available from https://proceedings.mlr.press/v235/evans24a.html.

Related Material