NTPP: Generative Speech Language Modeling for Dual-Channel Spoken Dialogue via Next-Token-Pair Prediction

Qichao Wang, Ziqiao Meng, Wenqian Cui, Yifei Zhang, Pengcheng Wu, Bingzhe Wu, Irwin King, Liang Chen, Peilin Zhao
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:63912-63928, 2025.

Abstract

Inspired by the impressive capabilities of GPT-4o, there is growing interest in enabling speech language models (SLMs) to engage in natural, fluid spoken interactions with humans. Recent advancements have led to the development of several SLMs that demonstrate promising results in this area. However, current approaches have yet to fully exploit dual-channel speech data, which inherently captures the structure and dynamics of human conversation. In this work, we systematically explore the use of dual-channel speech data in the context of modern large language models, and introduce a novel generative modeling paradigm—Next-Token-Pair Prediction (NTPP)—to enable speaker-independent dual-channel spoken dialogue learning using decoder-only architectures for the first time. We evaluate our approach on standard benchmarks, and empirical results show that our proposed method, NTPP, significantly improves the conversational abilities of SLMs in terms of turn-taking prediction, response coherence, and naturalness. Moreover, compared to existing methods, NTPP achieves substantially lower inference latency, highlighting its practical efficiency for real-time applications. Demo and code can be found at https://audio-3059.pages.dev.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-wang25by, title = {{NTPP}: Generative Speech Language Modeling for Dual-Channel Spoken Dialogue via Next-Token-Pair Prediction}, author = {Wang, Qichao and Meng, Ziqiao and Cui, Wenqian and Zhang, Yifei and Wu, Pengcheng and Wu, Bingzhe and King, Irwin and Chen, Liang and Zhao, Peilin}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {63912--63928}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/wang25by/wang25by.pdf}, url = {https://proceedings.mlr.press/v267/wang25by.html}, abstract = {Inspired by the impressive capabilities of GPT-4o, there is growing interest in enabling speech language models (SLMs) to engage in natural, fluid spoken interactions with humans. Recent advancements have led to the development of several SLMs that demonstrate promising results in this area. However, current approaches have yet to fully exploit dual-channel speech data, which inherently captures the structure and dynamics of human conversation. In this work, we systematically explore the use of dual-channel speech data in the context of modern large language models, and introduce a novel generative modeling paradigm—Next-Token-Pair Prediction (NTPP)—to enable speaker-independent dual-channel spoken dialogue learning using decoder-only architectures for the first time. We evaluate our approach on standard benchmarks, and empirical results show that our proposed method, NTPP, significantly improves the conversational abilities of SLMs in terms of turn-taking prediction, response coherence, and naturalness. Moreover, compared to existing methods, NTPP achieves substantially lower inference latency, highlighting its practical efficiency for real-time applications. Demo and code can be found at https://audio-3059.pages.dev.} }
Endnote
%0 Conference Paper %T NTPP: Generative Speech Language Modeling for Dual-Channel Spoken Dialogue via Next-Token-Pair Prediction %A Qichao Wang %A Ziqiao Meng %A Wenqian Cui %A Yifei Zhang %A Pengcheng Wu %A Bingzhe Wu %A Irwin King %A Liang Chen %A Peilin Zhao %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-wang25by %I PMLR %P 63912--63928 %U https://proceedings.mlr.press/v267/wang25by.html %V 267 %X Inspired by the impressive capabilities of GPT-4o, there is growing interest in enabling speech language models (SLMs) to engage in natural, fluid spoken interactions with humans. Recent advancements have led to the development of several SLMs that demonstrate promising results in this area. However, current approaches have yet to fully exploit dual-channel speech data, which inherently captures the structure and dynamics of human conversation. In this work, we systematically explore the use of dual-channel speech data in the context of modern large language models, and introduce a novel generative modeling paradigm—Next-Token-Pair Prediction (NTPP)—to enable speaker-independent dual-channel spoken dialogue learning using decoder-only architectures for the first time. We evaluate our approach on standard benchmarks, and empirical results show that our proposed method, NTPP, significantly improves the conversational abilities of SLMs in terms of turn-taking prediction, response coherence, and naturalness. Moreover, compared to existing methods, NTPP achieves substantially lower inference latency, highlighting its practical efficiency for real-time applications. Demo and code can be found at https://audio-3059.pages.dev.
APA
Wang, Q., Meng, Z., Cui, W., Zhang, Y., Wu, P., Wu, B., King, I., Chen, L. & Zhao, P.. (2025). NTPP: Generative Speech Language Modeling for Dual-Channel Spoken Dialogue via Next-Token-Pair Prediction. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:63912-63928 Available from https://proceedings.mlr.press/v267/wang25by.html.

Related Material