LLark: A Multimodal Instruction-Following Language Model for Music

Joshua P Gardner, Simon Durand, Daniel Stoller, Rachel M Bittner
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:15037-15082, 2024.

Abstract

Music has a unique and complex structure which is challenging for both expert humans and existing AI systems to understand, and presents unique challenges relative to other forms of audio. We present LLark, an instruction-tuned multimodal model for music understanding. We detail our process for dataset creation, which involves augmenting the annotations of diverse open-source music datasets and converting them to a unified instruction-tuning format. We propose a multimodal architecture for LLark, integrating a pretrained generative model for music with a pretrained language model. In evaluations on three types of tasks (music understanding, captioning, reasoning), we show that LLark matches or outperforms existing baselines in music understanding, and that humans show a high degree of agreement with its responses in captioning and reasoning tasks. LLark is trained entirely from open-source music data and models, and we make our training code available along with the release of this paper. Additional results and audio examples are at https://bit.ly/llark, and our source code is available at https://github.com/spotify-research/llark.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-gardner24a, title = {{LL}ark: A Multimodal Instruction-Following Language Model for Music}, author = {Gardner, Joshua P and Durand, Simon and Stoller, Daniel and Bittner, Rachel M}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {15037--15082}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/gardner24a/gardner24a.pdf}, url = {https://proceedings.mlr.press/v235/gardner24a.html}, abstract = {Music has a unique and complex structure which is challenging for both expert humans and existing AI systems to understand, and presents unique challenges relative to other forms of audio. We present LLark, an instruction-tuned multimodal model for music understanding. We detail our process for dataset creation, which involves augmenting the annotations of diverse open-source music datasets and converting them to a unified instruction-tuning format. We propose a multimodal architecture for LLark, integrating a pretrained generative model for music with a pretrained language model. In evaluations on three types of tasks (music understanding, captioning, reasoning), we show that LLark matches or outperforms existing baselines in music understanding, and that humans show a high degree of agreement with its responses in captioning and reasoning tasks. LLark is trained entirely from open-source music data and models, and we make our training code available along with the release of this paper. Additional results and audio examples are at https://bit.ly/llark, and our source code is available at https://github.com/spotify-research/llark.} }
Endnote
%0 Conference Paper %T LLark: A Multimodal Instruction-Following Language Model for Music %A Joshua P Gardner %A Simon Durand %A Daniel Stoller %A Rachel M Bittner %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-gardner24a %I PMLR %P 15037--15082 %U https://proceedings.mlr.press/v235/gardner24a.html %V 235 %X Music has a unique and complex structure which is challenging for both expert humans and existing AI systems to understand, and presents unique challenges relative to other forms of audio. We present LLark, an instruction-tuned multimodal model for music understanding. We detail our process for dataset creation, which involves augmenting the annotations of diverse open-source music datasets and converting them to a unified instruction-tuning format. We propose a multimodal architecture for LLark, integrating a pretrained generative model for music with a pretrained language model. In evaluations on three types of tasks (music understanding, captioning, reasoning), we show that LLark matches or outperforms existing baselines in music understanding, and that humans show a high degree of agreement with its responses in captioning and reasoning tasks. LLark is trained entirely from open-source music data and models, and we make our training code available along with the release of this paper. Additional results and audio examples are at https://bit.ly/llark, and our source code is available at https://github.com/spotify-research/llark.
APA
Gardner, J.P., Durand, S., Stoller, D. & Bittner, R.M.. (2024). LLark: A Multimodal Instruction-Following Language Model for Music. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:15037-15082 Available from https://proceedings.mlr.press/v235/gardner24a.html.

Related Material