DITTO: Diffusion Inference-Time T-Optimization for Music Generation

Zachary Novack, Julian Mcauley, Taylor Berg-Kirkpatrick, Nicholas J. Bryan
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:38426-38447, 2024.

Abstract

We propose Diffusion Inference-Time T-Optimization (DITTO), a general-purpose framework for controlling pre-trained text-to-music diffusion models at inference-time via optimizing initial noise latents. Our method can be used to optimize through any differentiable feature matching loss to achieve a target (stylized) output and leverages gradient checkpointing for memory efficiency. We demonstrate a surprisingly wide-range of applications for music generation including inpainting, outpainting, and looping as well as intensity, melody, and musical structure control – all without ever fine-tuning the underlying model. When we compare our approach against related training, guidance, and optimization-based methods, we find DITTO achieves state-of-the-art performance on nearly all tasks, including outperforming comparable approaches on controllability, audio quality, and computational efficiency, thus opening the door for high-quality, flexible, training-free control of diffusion models. Sound examples can be found at https://ditto-music.github.io/web/.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-novack24a, title = {{DITTO}: Diffusion Inference-Time T-Optimization for Music Generation}, author = {Novack, Zachary and Mcauley, Julian and Berg-Kirkpatrick, Taylor and Bryan, Nicholas J.}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {38426--38447}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/novack24a/novack24a.pdf}, url = {https://proceedings.mlr.press/v235/novack24a.html}, abstract = {We propose Diffusion Inference-Time T-Optimization (DITTO), a general-purpose framework for controlling pre-trained text-to-music diffusion models at inference-time via optimizing initial noise latents. Our method can be used to optimize through any differentiable feature matching loss to achieve a target (stylized) output and leverages gradient checkpointing for memory efficiency. We demonstrate a surprisingly wide-range of applications for music generation including inpainting, outpainting, and looping as well as intensity, melody, and musical structure control – all without ever fine-tuning the underlying model. When we compare our approach against related training, guidance, and optimization-based methods, we find DITTO achieves state-of-the-art performance on nearly all tasks, including outperforming comparable approaches on controllability, audio quality, and computational efficiency, thus opening the door for high-quality, flexible, training-free control of diffusion models. Sound examples can be found at https://ditto-music.github.io/web/.} }
Endnote
%0 Conference Paper %T DITTO: Diffusion Inference-Time T-Optimization for Music Generation %A Zachary Novack %A Julian Mcauley %A Taylor Berg-Kirkpatrick %A Nicholas J. Bryan %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-novack24a %I PMLR %P 38426--38447 %U https://proceedings.mlr.press/v235/novack24a.html %V 235 %X We propose Diffusion Inference-Time T-Optimization (DITTO), a general-purpose framework for controlling pre-trained text-to-music diffusion models at inference-time via optimizing initial noise latents. Our method can be used to optimize through any differentiable feature matching loss to achieve a target (stylized) output and leverages gradient checkpointing for memory efficiency. We demonstrate a surprisingly wide-range of applications for music generation including inpainting, outpainting, and looping as well as intensity, melody, and musical structure control – all without ever fine-tuning the underlying model. When we compare our approach against related training, guidance, and optimization-based methods, we find DITTO achieves state-of-the-art performance on nearly all tasks, including outperforming comparable approaches on controllability, audio quality, and computational efficiency, thus opening the door for high-quality, flexible, training-free control of diffusion models. Sound examples can be found at https://ditto-music.github.io/web/.
APA
Novack, Z., Mcauley, J., Berg-Kirkpatrick, T. & Bryan, N.J.. (2024). DITTO: Diffusion Inference-Time T-Optimization for Music Generation. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:38426-38447 Available from https://proceedings.mlr.press/v235/novack24a.html.

Related Material