UniAudio: Towards Universal Audio Generation with Large Language Models

Dongchao Yang, Jinchuan Tian, Xu Tan, Rongjie Huang, Songxiang Liu, Haohan Guo, Xuankai Chang, Jiatong Shi, Sheng Zhao, Jiang Bian, Zhou Zhao, Xixin Wu, Helen M. Meng
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:56422-56447, 2024.

Abstract

Audio generation is a major branch of generative AI research. Compared with prior works in this area that are commonly task-specific with heavy domain knowledge, this paper advocates building universal audio generation models that can handle various tasks in a unified manner. As recent research on large language models (LLMs) has demonstrated their strong ability to handle multiple tasks, this work presents UniAudio, an LLM-based audio generation model that supports a wide range of audio generation tasks. Based on various input conditions, such as phoneme, text description, or audio itself, UniAudio can generate speech, sound, music, and singing voice. The proposed UniAudio is built with 100k hours of multi-source open-available audio data and is scaled to 1B parameters. The audio tokenization method and language model architecture are also specifically designed for both performance and efficiency. Experimentally, UniAuido supports 11 audio generation tasks and achieves competitive results on all tasks consistently. We also show that UniAudio can support new tasks seamlessly via simple fine-tuning.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-yang24x, title = {{U}ni{A}udio: Towards Universal Audio Generation with Large Language Models}, author = {Yang, Dongchao and Tian, Jinchuan and Tan, Xu and Huang, Rongjie and Liu, Songxiang and Guo, Haohan and Chang, Xuankai and Shi, Jiatong and Zhao, Sheng and Bian, Jiang and Zhao, Zhou and Wu, Xixin and Meng, Helen M.}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {56422--56447}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/yang24x/yang24x.pdf}, url = {https://proceedings.mlr.press/v235/yang24x.html}, abstract = {Audio generation is a major branch of generative AI research. Compared with prior works in this area that are commonly task-specific with heavy domain knowledge, this paper advocates building universal audio generation models that can handle various tasks in a unified manner. As recent research on large language models (LLMs) has demonstrated their strong ability to handle multiple tasks, this work presents UniAudio, an LLM-based audio generation model that supports a wide range of audio generation tasks. Based on various input conditions, such as phoneme, text description, or audio itself, UniAudio can generate speech, sound, music, and singing voice. The proposed UniAudio is built with 100k hours of multi-source open-available audio data and is scaled to 1B parameters. The audio tokenization method and language model architecture are also specifically designed for both performance and efficiency. Experimentally, UniAuido supports 11 audio generation tasks and achieves competitive results on all tasks consistently. We also show that UniAudio can support new tasks seamlessly via simple fine-tuning.} }
Endnote
%0 Conference Paper %T UniAudio: Towards Universal Audio Generation with Large Language Models %A Dongchao Yang %A Jinchuan Tian %A Xu Tan %A Rongjie Huang %A Songxiang Liu %A Haohan Guo %A Xuankai Chang %A Jiatong Shi %A Sheng Zhao %A Jiang Bian %A Zhou Zhao %A Xixin Wu %A Helen M. Meng %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-yang24x %I PMLR %P 56422--56447 %U https://proceedings.mlr.press/v235/yang24x.html %V 235 %X Audio generation is a major branch of generative AI research. Compared with prior works in this area that are commonly task-specific with heavy domain knowledge, this paper advocates building universal audio generation models that can handle various tasks in a unified manner. As recent research on large language models (LLMs) has demonstrated their strong ability to handle multiple tasks, this work presents UniAudio, an LLM-based audio generation model that supports a wide range of audio generation tasks. Based on various input conditions, such as phoneme, text description, or audio itself, UniAudio can generate speech, sound, music, and singing voice. The proposed UniAudio is built with 100k hours of multi-source open-available audio data and is scaled to 1B parameters. The audio tokenization method and language model architecture are also specifically designed for both performance and efficiency. Experimentally, UniAuido supports 11 audio generation tasks and achieves competitive results on all tasks consistently. We also show that UniAudio can support new tasks seamlessly via simple fine-tuning.
APA
Yang, D., Tian, J., Tan, X., Huang, R., Liu, S., Guo, H., Chang, X., Shi, J., Zhao, S., Bian, J., Zhao, Z., Wu, X. & Meng, H.M.. (2024). UniAudio: Towards Universal Audio Generation with Large Language Models. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:56422-56447 Available from https://proceedings.mlr.press/v235/yang24x.html.

Related Material