A Variational Framework for Improving Naturalness in Generative Spoken Language Models

Li-Wei Chen, Takuya Higuchi, Zakaria Aldeneh, Ahmed Hussen Abdelaziz, Alexander Rudnicky
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:8226-8244, 2025.

Abstract

The success of large language models in text processing has inspired their adaptation to speech modeling. However, since speech is continuous and complex, it is often discretized for autoregressive modeling. Speech tokens derived from self-supervised models (known as semantic tokens) typically focus on the linguistic aspects of speech but neglect prosodic information. As a result, models trained on these tokens can generate speech with reduced naturalness. Existing approaches try to fix this by adding pitch features to the semantic tokens. However, pitch alone cannot fully represent the range of paralinguistic attributes, and selecting the right features requires careful hand-engineering. To overcome this, we propose an end-to-end variational approach that automatically learns to encode these continuous speech attributes to enhance the semantic tokens. Our approach eliminates the need for manual extraction and selection of paralinguistic features. Moreover, it produces preferred speech continuations according to human raters. Code, samples and models are available at https://github.com/b04901014/vae-gslm.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-chen25z, title = {A Variational Framework for Improving Naturalness in Generative Spoken Language Models}, author = {Chen, Li-Wei and Higuchi, Takuya and Aldeneh, Zakaria and Abdelaziz, Ahmed Hussen and Rudnicky, Alexander}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {8226--8244}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/chen25z/chen25z.pdf}, url = {https://proceedings.mlr.press/v267/chen25z.html}, abstract = {The success of large language models in text processing has inspired their adaptation to speech modeling. However, since speech is continuous and complex, it is often discretized for autoregressive modeling. Speech tokens derived from self-supervised models (known as semantic tokens) typically focus on the linguistic aspects of speech but neglect prosodic information. As a result, models trained on these tokens can generate speech with reduced naturalness. Existing approaches try to fix this by adding pitch features to the semantic tokens. However, pitch alone cannot fully represent the range of paralinguistic attributes, and selecting the right features requires careful hand-engineering. To overcome this, we propose an end-to-end variational approach that automatically learns to encode these continuous speech attributes to enhance the semantic tokens. Our approach eliminates the need for manual extraction and selection of paralinguistic features. Moreover, it produces preferred speech continuations according to human raters. Code, samples and models are available at https://github.com/b04901014/vae-gslm.} }
Endnote
%0 Conference Paper %T A Variational Framework for Improving Naturalness in Generative Spoken Language Models %A Li-Wei Chen %A Takuya Higuchi %A Zakaria Aldeneh %A Ahmed Hussen Abdelaziz %A Alexander Rudnicky %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-chen25z %I PMLR %P 8226--8244 %U https://proceedings.mlr.press/v267/chen25z.html %V 267 %X The success of large language models in text processing has inspired their adaptation to speech modeling. However, since speech is continuous and complex, it is often discretized for autoregressive modeling. Speech tokens derived from self-supervised models (known as semantic tokens) typically focus on the linguistic aspects of speech but neglect prosodic information. As a result, models trained on these tokens can generate speech with reduced naturalness. Existing approaches try to fix this by adding pitch features to the semantic tokens. However, pitch alone cannot fully represent the range of paralinguistic attributes, and selecting the right features requires careful hand-engineering. To overcome this, we propose an end-to-end variational approach that automatically learns to encode these continuous speech attributes to enhance the semantic tokens. Our approach eliminates the need for manual extraction and selection of paralinguistic features. Moreover, it produces preferred speech continuations according to human raters. Code, samples and models are available at https://github.com/b04901014/vae-gslm.
APA
Chen, L., Higuchi, T., Aldeneh, Z., Abdelaziz, A.H. & Rudnicky, A.. (2025). A Variational Framework for Improving Naturalness in Generative Spoken Language Models. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:8226-8244 Available from https://proceedings.mlr.press/v267/chen25z.html.

Related Material