Generative Audio Language Modeling with Continuous-valued Tokens and Masked Next-Token Prediction

Shu-Wen Yang, Byeonggeun Kim, Kuan-Po Huang, Qingming Tang, Huy Phan, Bo-Ru Lu, Harshavardhan Sundar, Shalini Ghosh, Hung-Yi Lee, Chieh-Chi Kao, Chao Wang
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:70793-70812, 2025.

Abstract

Autoregressive next-token prediction with the Transformer decoder has become a de facto standard in large language models (LLMs), achieving remarkable success in Natural Language Processing (NLP) at scale. Extending this paradigm to audio poses unique challenges due to its inherently continuous nature. We research audio generation with a causal language model (LM) without discrete tokens. We leverage token-wise diffusion to model the continuous distribution of the next continuous-valued token. Our approach delivers significant improvements over previous discrete solution, AudioGen, achieving 20% and 40% relative gains on AudioCaps in Frechet Audio Distance (FAD) and Kullback-Leibler (KL) divergence, respectively. Additionally, we propose a novel masked next-token prediction task that incorporates masked prediction into the causal LM framework. On AudioCaps, the innovation yields 41% and 33% relative FAD improvements over AudioGen Base (285M) and AudioGen Large (1B) models, respectively, and is on par with the state-of-the-art (SOTA) diffusion models. Furthermore, we achieve these results with significantly fewer parameters—193M for our Base and 462M for our Large models.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-yang25n, title = {Generative Audio Language Modeling with Continuous-valued Tokens and Masked Next-Token Prediction}, author = {Yang, Shu-Wen and Kim, Byeonggeun and Huang, Kuan-Po and Tang, Qingming and Phan, Huy and Lu, Bo-Ru and Sundar, Harshavardhan and Ghosh, Shalini and Lee, Hung-Yi and Kao, Chieh-Chi and Wang, Chao}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {70793--70812}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/yang25n/yang25n.pdf}, url = {https://proceedings.mlr.press/v267/yang25n.html}, abstract = {Autoregressive next-token prediction with the Transformer decoder has become a de facto standard in large language models (LLMs), achieving remarkable success in Natural Language Processing (NLP) at scale. Extending this paradigm to audio poses unique challenges due to its inherently continuous nature. We research audio generation with a causal language model (LM) without discrete tokens. We leverage token-wise diffusion to model the continuous distribution of the next continuous-valued token. Our approach delivers significant improvements over previous discrete solution, AudioGen, achieving 20% and 40% relative gains on AudioCaps in Frechet Audio Distance (FAD) and Kullback-Leibler (KL) divergence, respectively. Additionally, we propose a novel masked next-token prediction task that incorporates masked prediction into the causal LM framework. On AudioCaps, the innovation yields 41% and 33% relative FAD improvements over AudioGen Base (285M) and AudioGen Large (1B) models, respectively, and is on par with the state-of-the-art (SOTA) diffusion models. Furthermore, we achieve these results with significantly fewer parameters—193M for our Base and 462M for our Large models.} }
Endnote
%0 Conference Paper %T Generative Audio Language Modeling with Continuous-valued Tokens and Masked Next-Token Prediction %A Shu-Wen Yang %A Byeonggeun Kim %A Kuan-Po Huang %A Qingming Tang %A Huy Phan %A Bo-Ru Lu %A Harshavardhan Sundar %A Shalini Ghosh %A Hung-Yi Lee %A Chieh-Chi Kao %A Chao Wang %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-yang25n %I PMLR %P 70793--70812 %U https://proceedings.mlr.press/v267/yang25n.html %V 267 %X Autoregressive next-token prediction with the Transformer decoder has become a de facto standard in large language models (LLMs), achieving remarkable success in Natural Language Processing (NLP) at scale. Extending this paradigm to audio poses unique challenges due to its inherently continuous nature. We research audio generation with a causal language model (LM) without discrete tokens. We leverage token-wise diffusion to model the continuous distribution of the next continuous-valued token. Our approach delivers significant improvements over previous discrete solution, AudioGen, achieving 20% and 40% relative gains on AudioCaps in Frechet Audio Distance (FAD) and Kullback-Leibler (KL) divergence, respectively. Additionally, we propose a novel masked next-token prediction task that incorporates masked prediction into the causal LM framework. On AudioCaps, the innovation yields 41% and 33% relative FAD improvements over AudioGen Base (285M) and AudioGen Large (1B) models, respectively, and is on par with the state-of-the-art (SOTA) diffusion models. Furthermore, we achieve these results with significantly fewer parameters—193M for our Base and 462M for our Large models.
APA
Yang, S., Kim, B., Huang, K., Tang, Q., Phan, H., Lu, B., Sundar, H., Ghosh, S., Lee, H., Kao, C. & Wang, C.. (2025). Generative Audio Language Modeling with Continuous-valued Tokens and Masked Next-Token Prediction. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:70793-70812 Available from https://proceedings.mlr.press/v267/yang25n.html.

Related Material